scieee Science in your language
[en] (orig)

Extracting and Utilizing Structured Data from the Open Web Index

Author: Caspari, Laura; Dinzinger, Michael; Granitzer, Michael; Mitrovic, Jelena
Publisher: Zenodo
DOI: 10.5281/zenodo.17228339
Source: https://zenodo.org/records/17228339/files/ossym-2025--n05-s83--Extracting-and-Utilizing-Structured-Data-from-the-Open-Web-Index--L-Caspari--v1-a4-doi.pdf
EXTRACTING AND UTILIZING STRUCTURED DATA FROM THE OPEN
WEB INDEX
L. Caspa i∗, M. Dinzinge , J. Mi o ic, M. G ani ze , Uni e si y o Passau, Passau, Ge many
Abs ac
S uc u ed da a is a aluable sou ce o in o ma ion ha
can be ound on many web pages and can be ex ac ed e -
icien ly du ing c awling. I is o en encoded in he o m
o Ja aSc ip Objec No a ion o Linked Da a (JSON-LD)
o Mic oda a using schema.o g de ini ions o en i ies such
as FAQ pages o add esses, allowing e icien pa sing and
ex ac ion o da a. The OpenWebSea ch.EU (OWS) [Hen-
d iksen e al
.
(2024)] p ojec , which publicly eleases c awled
web da a on a egula basis, is a use ul sou ce o esh s uc-
u ed da a, as published da ase s con ain speci ic columns
o JSON-LD and Mic oda a encoun e ed du ing c awling.
In his pape , we p esen ini ial s a is ics on he occu ence
o s uc u ed da a in he OWS da ase s, ocusing on he p es-
ence o ce ain en i ies in schema.o g, namely F equen ly
Asked Ques ions (FAQs), opening hou s, phone numbe s,
and add esses. Addi ionally, we discuss wo p ac ical appli-
ca ion scena ios o he ex ac ed da a. In ou i s use case,
in line wi h ou p e ious wo k [Dinzinge e al.(2025)], we
demons a e how FAQ da a can be used o cons uc mul i-
lingual Q&A-s yle da ase s, which can be used o ain la ge
language models (LLMs) o asks like ques ion answe ing
o e ie al. In ou second case, we show he po en ial o
s uc u ed da a o en ich map applica ions and imp o e use
expe ience. These use cases exempli y he alue o s uc-
u ed da a and demons a e he bene i s o i s sys ema ic
ex ac ion and in eg a ion in o eal-wo ld applica ions.
INTRODUCTION
S uc u ed da a p o ides an impo an sou ce o in o -
ma ion abou webpages ha can easily be pa sed and in-
ges ed by downs eam applica ions like sea ch engines o
map p o ide s. I can be used o en ich he esul s shown
o use s, i.e. by displaying he add ess and opening hou s
o a shop. S uc u ed da a is speci ied by webmas e s using
schema.o g de ini ions o en i ies like add esses, opening
hou s o FAQs. While i can be speci ied in a a ie y o
o ma s, JSON-LD and Mic oda a ha e eme ged as popu-
la choices [Volpini e al
.
(2024)]. As de ining in o ma ion
in hese o ma s equi es addi ional e o om webmas-
e s, he da a is gene ally o high quali y and can easily be
ex ac ed du ing c awling. Howe e , many downs eam ap-
plica ions equi e he ex ac ed da a o be esh, necessi a ing
equen e isi s o webpages. An impo an sou ce o esh
s uc u ed da a is he OpenWebSea ch p ojec [G ani ze
e al
.
(2024)], which eleases c awled JSON-LD and Mic o-
da a as pa o hei egula ly eleased da ase s. Thus, he
p ojec p esen s an impo an esou ce o publicly a ailable
and esh s uc u ed da a. To be e unde s and he p e a-
∗[email p o ec ed]
lence o JSON-LD and Mic oda a wi hin he OWS index, we
analyze da ase s om i e days in Feb ua y 2025 wi h a o al
size o 1.6TB. We ind ha mo e han hal (54.2%) o he
c awled webpages use JSON-LD o Mic oda a. Howe e ,
when a emp ing o ex ac speci ic schemas, he a io d ops
signi ican ly, i.e. phone numbe s can only be ound on 1.8%
o webpages.
Gi en hese ex ac ed schemas, we es ablish wo use cases
o s uc u ed da a, namely le e aging FAQ pages o con-
s uc a Q&A da ase ha can be used o ain o e alua e
LLMs on ques ion answe ing o e ie al asks, and ex ac -
ing FAQs, opening hou s, add esses and phone numbe s o
enhance map applica ions. Ou use cases demons a e he
po en ial o s uc u ed da a in downs eam applica ions and
he impo ance o ha ing publicly a ailable and esh da a o
enable esea che s and companies o build upon he weal h
o in o ma ion con ained wi hin. The code o ep oduce ou
esul s is a ailable on Gi Hub1.
The emainde o his pape is s uc u ed as ollows: A e
looking a ela ed wo k, he me hodology sec ion gi es an
o e iew o he schema de ini ions we ocus on and explains
ou ex ac ion app oach. The ollowing sec ion con ains
s a is ics abou he occu ence o Mic oda a and JSON-LD
as well as o speci ic schemas like add esses o phone num-
be s. We hen de ail ou wo use cases o he ex ac ed
da a be o e discussing limi a ions and p oblems wi h da a
quali y. Finally, we conclude ou pape wi h a summa y o
ou indings and u u e ex ensions o ou wo k.
RELATED WORK
Using schema.o g
2
classes o ep esen in o ma ion abou
en i ies like es au an s, e en s o p oduc s has become
inc easingly p e alen since he in oduc ion o schema
speci ica ions in 2011 [B inkmann e al
.
(2023), Volpini
e al
.
(2024)]. Es ablished by Google, Bing, Yahoo and Yan-
dex o o e webmas e s a uni ied way o de ining s uc-
u ed da a, i p o ides in o ma ion in machine- eadable and
widely accep ed o ma s, wi h JSON-LD and Mic oda a
being common ways in which s uc u ed da a is made a ail-
able. Mic oda a is an ex ension o he HTML5 speci ica-
ion
3
and is hus added di ec ly o he HTML ags hem-
sel es, whe eas JSON-LD is speci ied wi hin one se o
sc ip ags
4
, making i easy o ex ac . While bo h o ma s
ha e seen inc easing adop ion in he pas yea s wi h JSON-
LD being p esen on 41% o webpages and Mic oda a on
26% in 2024 [Volpini e al
.
(2024)], hei usage pa e ns
di e . As analyzed by Volpini e al., JSON-LD is mos
1h ps://gi hub.com/padas-lab-de/owi-sdm
2h ps://schema.o g/
3h ps://h ml.spec.wha wg.o g/mul ipage/mic oda a.h ml
4h ps://json-ld.o g/
h ps://doi.o g/10.5281/zenodo.17228339
commonly used o o ganiza ion da a, local businesses and
p oduc lis ings, whe eas Mic oda a o en speci ies webpage
s uc u e o si e na iga ion. Apa om he imbalance o
schema usage be ween he di e en o ma s, adop ion o
schema anno a ions also a ies depending on he domain
wi h a much highe usage o en i ies like p oduc s o local
businesses [B inkmann e al
.
(2023)] han en i ies ela ed o
educa ional esou ces [Na a e e e al.(2019)].
The highe p e alence o s uc u ed da a inc eases he
alue o ex ac ing i s con en o downs eam applica ions.
Apa om using s uc u ed da a o inc ease sea ch is-
ibili y [Recalde e al
.
(2021)], i can also be ex ac ed o
p o ide aining da a o machine lea ning models [Pee e s
e al
.
(2020), Dinzinge e al
.
(2025)]. While he usage o
schema-based anno a ions equi es addi ional e o by web-
mas e s and hus is gene ally o high quali y, applica ions us-
ing s uc u ed da a s ill need o il e ou low quali y samples.
Fo ins ance, a high pe cen age o schema.o g da ase anno a-
ions do no desc ibe ac ual da ase s [Al ashed e al
.
(2021)],
d as ically limi ing he usabili y o his schema o da ase
sea ch. Simila ly, ce ain p ope ies o common en i ies
like p oduc s, e.g. he p oduc ID o ca ego y, a e seldom
illed [B inkmann e al.(2023)].
METHODOLOGY
The ollowing pa ag aphs o e a gene al in oduc ion
in o he speci ica ion o s uc u ed da a and de ine he ex-
ac schema classes ha a e o in e es . Subsequen ly, an
o e iew o he ex ac ion pipeline is p o ided along wi h
he o ma in which ex ac ed da a is s o ed.
De ining S uc u ed Da a wi h Schemas
In ou con ex , s uc u ed da a is speci ied using en i ies
de ined by he schema.o g ype hie a chy. Fo each en i y, e.g.
an FAQ page, schema.o g de ines he p ope ies and i s ype,
which a e speci ied in a key- alue-based manne . While
he e a e a ious ways o speci ying s uc u ed schema da a,
his pape will ocus on JSON-LD and Mic oda a, which a e
pa o he da ase s published by OWS. Due o ou cu en
use cases, we will speci ically conside schemas o de in-
ing phone numbe s
5
, add esses
6
, opening hou s
7
and FAQ
pages8.
Figu e 1 shows an exce p o he a o emen ioned schemas
in JSON-LD which we e encoun e ed when c awling he
webpage o a Subway s o e loca ed in Sea le. The s uc u ed
da a con ains impo an in o ma ion abou he s o e which
can be used o en ich downs eam applica ions.
Ex ac ing and Me ging Schemas
While s uc u ed da a is a aluable esou ce, i is no a ail-
able o e e y webpage. The e o e, we i s use owilix
9
o
5h ps://schema.o g/ elephone
6h ps://schema.o g/add ess
7h ps://schema.o g/openingHou s
8h ps://schema.o g/FAQPage
9h ps://opencode.i 4i.eu/openwebsea cheu-public/
owi-cli
{
...
" elephone": "(425) 614-3256",
"add ess": {
"@ ype": "Pos alAdd ess",
"add essCoun y": "US",
"add essLocali y": "Belle ue King",
"add essRegion": "WA",
"pos alCode": "98007",
"s ee Add ess": "1410 156 h A e NE"
},
"openingHou s": ["Mo 08:00-22:00", "Tu 08:00-22:00", "We
08:00-22:00", "Th 08:00-22:00", "F 08:00-22:00", "Sa
09:00-22:00", "Su 09:00-22:00"]
"@ ype": "FAQPage",
"mainEn i y": [{
"@ ype": "Ques ion",
"name": "How can I place a Subway Ca e ing o de ?",
"accep edAnswe ": {
"@ ype": "Answe ",
" ex ": "To place an o de , isi us online a
ca e ing.subway.com o call you local es au an ."
}
...
}]
...
}
Figu e 1: An exce p o JSON-LD ex ac ed om he page
o a Subway s o e in Sea le.
download OWS da ase s om i e di e en days in Feb ua y
2025, which con ain iles in Pa que o ma
10
. Speci ically,
we use he da ase s published on he 19 h and 21s -24 h o
Feb ua y. As he Pa que iles include speci ic columns o
Mic oda a and JSON-LD, we subsequen ly il e ou all en-
ies o which bo h columns a e emp y and only use columns
ha a e o in e es o us, educing he ini ial size o 1.6TB
by a ac o o ou . We hen apply ou ex ac ion code on
he il e ed da a o ob ain FAQs, opening hou s, add esses
and phone numbe s con ained in he s uc u ed da a, wi h
he ex ac ed in o ma ion being sa ed o Pa que iles. As
he s uc u e o he da a is qui e dependen on he schema,
we s o e each in a sepa a e Pa que ile wi h he excep ion o
phone numbe s and add esses, which a e me ged oge he .
The esul ing iles o ganized pe day a e a ailable o down-
load om ou MinIO ins ance11.
DATA EXPLORATION
To ge an ini ial idea abou how o en s uc u ed da a
appea s in he OWS c awls, we analyze he il e ed da ase s
om Feb ua y 2025 and ind ha 54.2% o webpages con ain
mic oda a o JSON-LD. Howe e , as shown in Table 1, his
numbe quickly d ops when looking a a speci ic schema.
In ac , all schemas we a e in e es ed in occu on less han
2.1% o webpages.
Taking a close look a he indi idual schemas, we also
obse e ha a signi ican numbe o hem a e mal o med o
con ain in alid da a when applying simple sani y checks. To
ensu e some basic quali y o he ex ac ed da a, we igno e
en ies ha only con ain emp y alues. Fu he mo e, o
phone numbe s and opening hou s, we ensu e ha he ex-
10h ps://pa que .apache.o g/
11h ps://console.sha e.innkube. im.uni-passau.de/
b owse /public/ows-ex ac ed%2F
h ps://doi.o g/10.5281/zenodo.17228339
Table 1: Occu ence o speci ic schema a in OWS da ase s.
Schema Name # %
Telephone 5,352,078 1.78
+Add ess 6,121,713 2.04
®FAQPage 1,645,691 0.55
OpeningHou s 2,210,530 0.73
ÛOpeningHou sSpeci ica ion 1,639,338 0.54
ac ed s ing con ains a leas one digi . Figu e 1 illus a es
ha hese simple measu es al eady lead o a la ge numbe
o disca ded en ies, showing ha many webmas e s s ug-
gle wi h obliging o he schema o ma o inse emp y o
in alid alues. A manual inspec ion o pa s o he ex ac ed
da a u he e ealed ha while mos en ies con ain sensi-
ble in o ma ion, some webmas e s used unhelp ul de aul
alues, i.e. "ques ion" and "answe " in ex ac ed Q&A pai s.
Ano he issue wi h da a quali y is posed by en ies ha only
con ain pa ial in o ma ion, i.e. an add ess ha only men-
ions he ci y, bu no he s ee add ess o he en i y. Fu he
p ocessing o he ex ac ed da a o ensu e high quali y hus
poses an impo an bu non- i ial ask o ou mul ilingual
da a.
USE CASES
In he ollowing sec ions, we desc ibe wo eal-wo ld use
cases o s uc u ed da a. The i s use case, he ex ac ion o
FAQ-s yle anno a ions o build a Q&A da ase , has al eady
been implemen ed. The second use case o ex ac ing s uc-
u ed da a o en ich map applica ions is a wo k in p og ess
in collabo a ion wi h Mu ena
12
, a company ha p o ides
deGoogled and p i acy p ese ing sma phones and cloud
se ices.
FAQ Da ase
FAQ pages ep esen ed in s uc u ed da a p o ide an in-
e es ing esou ce o building Q&A da ase s. Thei na u al
sepa a ion in o ques ions and answe s makes i easy o le e -
age hem o ques ion answe ing asks. Fu he mo e, as
he FAQ page schema equi es an answe o be speci ied
as ei he accep ed o sugges ed, he schema con ains an im-
plici ele ance signal which can be ex ac ed o make he
da ase usable o e ie al asks. Ou ecen wo k [Dinzinge
e al
.
(2025)], in which we buil a la ge-scale mul ilingual
e ie al da ase by ex ac ing FAQ page schemas om da a
p o ided by he Web Da a Commons (WDC) p ojec
13
,
clea ly demons a es he use o FAQ-s yle s uc u ed da a.
Fu he mo e, we show ha mul ilingual FAQs can be used o
build bilingual co po a o a la ge numbe o language com-
bina ions. Bo h WebFAQ e ie al
14
and WebFAQ bi ex
15
12h ps://mu ena.com/
13h ps://webda acommons.o g/
14h ps://hugging ace.co/da ase s/PaDaS-Lab/
web aq- e ie al
15h ps://hugging ace.co/da ase s/PaDaS-Lab/
web aq-bi ex s
elephone add ess openingHou s
0.0M
5.0M
10.0M
15.0M
20.0M
25.0M
30.0M
35.0M
40.0M ound
ex ac ed
Figu e 2: The numbe o ound and ex ac ed en ies pe
schema in millions.
a e a ailable on HuggingFace and as pa o he Massi e Tex
Embedding Benchma k (MTEB) [Muennigho e al
.
(2023)]
py hon package.
While he WDC dumps p o ide a la ge esou ce o na u al
Q&A da a, hey a e upda ed only on a yea ly basis, hus likely
con aining many s ale FAQs. The egula ly published OWS
da ase s can alle ia e his p oblem by p o iding esh da a
o c awled web pages. We he e o e apply he p ocedu e
de eloped o gene a e WebFAQ on he OWS da a, ex ac ing
a ound 9.95 million Q&A pai s ac oss he i e days. To
build a mul ilingual e ie al co pus, we pe o m language
classi ica ion on he ex ac ed Q&A pai s using Fas Tex
[Joulin e al
.
(2016)]. Figu e 3 shows he dis ibu ion o he
10 mos common languages ound in he ex ac ed FAQ da a.
While English unsu p isingly occu s mos o en, we also
ex ac a la ge numbe o Q&A pai s o o he languages
like Ge man, Spanish o F ench. Simila ly o WebFAQ, he
FAQs ex ac ed om OWS da a a e a ailable as a collec ion
o mul ilingual e ie al da ase s on HuggingFace16.
En iching Map Applica ions
Apa om he FAQPage schema se ing as a aluable
s a ing poin o Q&A da ase s, he schemas we ha e
ex ac ed can also se e as a aluable esou ce o (non-
)comme cial map applica ions. To his end, we a e cu en ly
collabo a ing wi h Mu ena, in an e o o enhance he da a
p o ided by OpenS ee Map
17
. While OpenS ee Map p o-
ides use ul in o ma ion like add esses o opening hou s o
poin s o in e es , d i en by a communi y o human mappe s
ha con ibu e he da a, ce ain pa s o his in o ma ion like
he opening hou s o a shop migh change oo equen ly o
be kep up o da e. This can lead o undesi able si ua ions
i use s ely on inco ec da a, e.g. i hey choose o isi
a shop jus o ind ha he opening hou s a e ou da ed and
he shop has al eady closed o he day. To alle ia e his
16h ps://hugging ace.co/da ase s/PaDaS-Lab/
owi- aq- e ie al
17h ps://www.opens ee map.o g
h ps://doi.o g/10.5281/zenodo.17228339
polish
1.4%
i alian
2.0%
po uguese
2.1%
du ch
2.4%
japanese
3.1%
ussian
3.7%
ench
4.7%
spanish
5.3%
ge man
7.0%
o he
15.4%
english
52.8%
Figu e 3: Language dis ibu ion o he 10 mos common
languages on FAQ pages.
p oblem, we aim o c awl and ex ac in o ma ion a ailable
in s uc u ed da a o speci ic URLs ha Mu ena is in e -
es ed in on a egula basis. As an ini ial es , we c awled
10,547 URLs ep esen ing poin s o in e es in he a ea o
Sea le and ex ac ed phone numbe s om 122 (1.2%) web-
pages, add esses om 326 (3.1%), FAQs om 290 (2.7%)
and opening hou s om 711 (6.7%). Al hough he absolu e
numbe o ex ac ed schemas emains low, hey can s ill con-
ibu e aluable and esh in o ma ion o a la ge numbe o
loca ions. As an example, Figu e 4 demons a es how he
da a ex ac ed om he JSON-LD pa ially shown in Figu e
1 can be p esen ed o use s. The da a was ex ac ed om he
webpage o a Subway s o e in Sea le and clea ly con ains
in o ma ion ha would bene i a map applica ion.
LIMITATIONS
While ex ac ing in o ma ion om s uc u ed da a seems
s aigh o wa d a i s glance, wo king wi h eal-wo ld da a
has p o en o be mo e challenging. One such challenge is
posed by he schema de ini ions hemsel es. Fo ins ance,
he e a e wo di e en ways o speci y opening hou s, namely
using he openingHou s
18
schema ha p o ides he in o -
ma ion as a dic iona y wi h a de ined se o keys o as a
simple ex as shown in Figu e 1. Ano he common issue a e
missing alues o some ields, ields con aining placeholde
alues o da a no con o ming o he speci ied schema.
As ou main ocus lies on ex ac ing he in o ma ion, we
add ess he i s p oblem by implemen ing ex ac o s spe-
ci ic o each schema ype and s o ing he in o ma ion in
sepa a e columns o he ou pu Pa que iles. Thus, we lea e
i o downs eam applica ions o me ge da a om di e en
schemas desc ibing he same en i y. While we apply sim-
ple sani y checks o he ex ac ed da a like checking i he
18h ps://schema.o g/openingHou s
1410 156 h A e NE, Belle ue King, WA
98007, US
(425) 614-3256
Mo-F : 08:00-22:00
Sa-Su: 09:00-22:00
Subway
Ques ions and Answe s
Ques ion: How can I place a Subway ca e ing o de ?
Answe : To place an o de , isi us online a
ca e ing.subway.com o call you local es au an .
View all ques ions and answe s
Figu e 4: The da a ex ac ed o a Subway s o e in Sea le
and how i could be p esen ed o use s.
schema con ains only emp y s ings o whe he da es o
phone numbe s con ain a leas one digi , doing comp ehen-
si e il e ing on a mul ilingual co pus is a non- i ial ask.
As such, we do no apply any complex il e ing echniques
on he ex ac ed da a o ensu e i s seman ic alidi y. Addi-
ionally, we a e unable o e i y he co ec ness o eshness
o he ex ac ed da a, i.e. i a phone numbe ound on he
page o a shop eally belongs o i and whe he he numbe
is s ill up o da e. Thus, downs eam applica ions wishing o
use he ex ac ed in o ma ion will likely ha e o implemen
addi ional il e ing echniques on op o ou da a o ensu e
high quali y.
CONCLUSION
S uc u ed da a has p o en o be an easily ex ac able
and aluable esou ce o a ious applica ion scena ios. In
his wo k, we ocused on analyzing and ex ac ing ce ain
ypes o Mic oda a and JSON-LD om da ase s p o ided
by he OWS p ojec . We ound ha while s uc u ed da a
is a ailable on mo e han hal o he c awled webpages, he
occu ence o speci ic schemas like add esses o opening
hou s is much less common. Ne e heless, we demons a e
he use ulness o he ex ac ed da a in wo applica ion sce-
na ios, i s gene a ing a ques ion answe ing and e ie al
da ase using he FAQPage schema, and hen p o iding addi-
ional in o ma ion like opening hou s, add esses and phone
numbe s o poin s o in e es , which can be used o en ich
map applica ions.
As we belie e ha p o iding in o ma ion ex ac ed om
s uc u ed da a is o gene al in e es , we plan o in eg a e
he ex ac ion mechanism as a egula s ep in he OWS p e-
p ocessing pipeline and c ea e a new collec ion index o
ex ac ed s uc u ed da a. This would allow in e es ed pa -
ies o download only he ex ac ed da a ins ead o he much
h ps://doi.o g/10.5281/zenodo.17228339
la ge s anda d OWS da ase s, as well as o upda e in o ma-
ion on poin s o in e es on a egula basis wi hou ha ing
o se up hei own ex ac ion pipelines. We will also ex-
pand ou wo k wi h Mu ena o c awl mo e poin s o in e es
and p o ide he ex ac ed in o ma ion as pa o he new
collec ion index.
ACKNOWLEDGEMENTS
This wo k has ecei ed unding om he Eu opean
Union’s Ho izon Eu ope esea ch and inno a ion p og am
unde g an ag eemen No 101070014 (OpenWebSea ch.EU,
h ps://doi.o g/10.3030/101070014).
REFERENCES
[Al ashed e al.(2021)]
Ta ah Al ashed, Dimi is Papa as, Oma
Benjelloun, Ying Sheng, and Na asha Noy. 2021. Da ase
o No ? A S udy on he Ve aci y o Seman ic Ma kup o
Da ase Pages. In The Seman ic Web – ISWC 2021, And eas
Ho ho, E a Blomq is , S e an Die ze, Achille Fokoue, Ying
Ding, Payam Ba naghi, A min Halle , Mau o D agoni, and
Ha i h Alani (Eds.). Sp inge In e na ional Publishing, Cham,
338–356.
[B inkmann e al.(2023)]
Alexande B inkmann, Anna P impeli,
and Ch is ian Bize . 2023. The Web Da a Commons
Schema.o g Da a Se Se ies. In Companion P oceedings o
he ACM Web Con e ence 2023 (Aus in, TX, USA) (WWW
’23 Companion). Associa ion o Compu ing Machine y, New
Yo k, NY, USA, 136–139.
h ps://doi.o g/10.1145/
3543873.3587331
[Dinzinge e al.(2025)]
Michael Dinzinge , Lau a Caspa i, Kan-
ishka Ghosh Das ida , Jelena Mi o ić, and Michael G ani ze .
2025. WebFAQ: A Mul ilingual Collec ion o Na u al Q&A
Da ase s o Dense Re ie al. a Xi :2502.20936 [cs.CL]
h ps://a xi .o g/abs/2502.20936
[G ani ze e al.(2024)]
Michael G ani ze , S e an Voig , Noo A -
shan Fa hima, Ma in Golasowski, Ch is ian Gue l, Tobias
Hecking, Gijs Hend iksen, Djoe d Hiems a, Jan Ma ino ič,
Jelena Mi o ić, Izido Mlaka , S a os Moi as, Alexande
Nussbaume , Pe Ös e , Ma in Po has , Ma jana Senča
S dič, Sha ikadze Megi, Ka eřina Slanino á, Benno S ein,
A jen P. de V ies, Ví Vond ák, And eas Wagne , and Sabe
Ze houdi. 2024. Impac and de elopmen o an Open Web
Index o open web sea ch. Jou nal o he Associa ion o
In o ma ion Science and Technology 75, 5 (2024), 512–520.
h ps://doi.o g/10.1002/asi.24818
[Hend iksen e al.(2024)]
Gijs Hend iksen, Michael Dinzinge ,
Sheikh Mas u a Fa zana, Noo A shan Fa hima, Maik F öbe,
Sebas ian Schmid , Sabe Ze houdi, Michael G ani ze ,
Ma hias Hagen, Djoe d Hiems a, Ma in Po has , and
Benno S ein. 2024. The Open Web Index. In Ad ances in
In o ma ion Re ie al, Nazli Goha ian, Nicola Tonello o, Yu-
lan He, Aldo Lipani, G aham McDonald, C aig Macdonald,
and Iadh Ounis (Eds.). Sp inge Na u e Swi ze land, Cham,
130–143.
[Joulin e al.(2016)]
A mand Joulin, Edoua d G a e, Pio Bo-
janowski, Ma hijs Douze, Hé e Jégou, and Tomas Mikolo .
2016. Fas Tex .zip: Comp essing ex classi ica ion models.
h ps://doi.o g/10.48550/ARXIV.1612.03651
[Muennigho e al.(2023)]
Niklas Muennigho , Nouamane Tazi,
Loïc Magne, and Nils Reime s. 2023. MTEB: Massi e Tex
Embedding Benchma k. a Xi :2210.07316 [cs.CL]
[Na a e e e al.(2019)]
Rosa Na a e e, Lo ena Recalde, Ca los
Mon eneg o, and Se gio Luján-Mo a. 2019. Analyzing Em-
bedded Seman ic wi h JSON-LD and Mic oda a o Edu-
ca ional Resou ces in La ge Scale Web Da ase s. In 2019
In e na ional Con e ence on Compu a ional Science and
Compu a ional In elligence (CSCI). 1133–1138.
h ps:
//doi.o g/10.1109/CSCI49370.2019.00214
[Pee e s e al.(2020)]
Ralph Pee e s, Anna P impeli, Benedik
Wich lhube , and Ch is ian Bize . 2020. Using schema.o g
Anno a ions o T aining and Main aining P oduc Ma ch-
e s. In P oceedings o he 10 h In e na ional Con e ence on
Web In elligence, Mining and Seman ics (Bia i z, F ance)
(WIMS 2020). Associa ion o Compu ing Machine y, New
Yo k, NY, USA, 195–204.
h ps://doi.o g/10.1145/
3405962.3405964
[Recalde e al.(2021)]
Lo ena Recalde, Rosa Na a e e, and Fe -
nando Pogo. 2021. Making Open Educa ional Resou ces
Disco e able: A JSON-LD Gene a o o OER Seman ic
Anno a ion. In 2021 Eigh h In e na ional Con e ence on
eDemoc acy & eGo e nmen (ICEDEG). 182–187.
h ps:
//doi.o g/10.1109/ICEDEG52154.2021.9530872
[Volpini e al.(2024)]
And ea Volpini, Ja no an D iel, Ryan Le -
e ing, Nu ullah Demi , and James Gallaghe . 2024. S uc-
u ed da a. HTTP A chi e, Chap e 3.
h ps://doi.o g/
10.5281/zenodo.14065771
h ps://doi.o g/10.5281/zenodo.17228339