Finding Language Examples and Information on the Web | Phora Nova
The Phora is an OSINT forum, so every once in a while we publish collections of links
I was working on a project to index certain files being used by LLMs to navigate the web, and found some older but still maintained resources.
Back ca. 1998, the Open Directory Project started to create an index of the known web. Over time, this was picked up by Netscape as the DMOZ directory, in the days of the Mozilla browser. At its height, ca. 2006, it had 95,000 curators (editors) and over a million categories. AOL took over Netscape, but work continued on DMOZ.
Then in 2017, for reasons unknown (money, likely), AOL decided to kill funding for the project. However, the Open Source community stepped up to the plate and continued the project as ‘Curlie’ — losing the lizard icon and selecting a red squirrel.
Be forewarned that it is a rare category that has more than say 10 items, but the level of curation compensates for that, and you can only find this way. The search function is aughties tier still, probably original, which means it is helpful but so-so by modern standards.
I predict that such human curation will become more important in the future, as AI slop takes over search engines and the web in general.
As a sample of the scope of indexed pages here is the coverage of languages from a linguistics perspective:
[URL unfurl=”true”]https://curlie.org/en/Science/Social_Sciences/Linguistics/Languages/Natural/[/URL]
The index itself sports 92 language translations which I list below (Scroll past the table of contents to the end for the languages)
Curlie – The Collector of URLs
1 Curlie – Adult – hidden from top level and structure only (no content)
dmoz – 15 subcategories
2 Curlie – Arts 54 subcategories
dmoz – 55 subcategories [[llms-arts]]
3 Curlie – Business 48 subcategories
dmoz – [[llms-business]]
4 Curlie – Computers 54 subcategories
dmoz – 53 subcategories
[[llms-computers]]
- Curlie – Computers: Internet (49) – dmoz has 50
- Curlie – Computers: Software (89)
5 Curlie – Games 32 Subcategories
dmoz – 31 subcategories [[llms-games]]
6 Curlie – Health 47 Subcategories
dmoz – [[llms-health]]
7 Curlie – Home 25 Subcategories
dmoz – no category page [[llms-homes]]
8 Curlie – Kids and Teens 14 Subcategories [[Curlie L2 Kids and Teens]]
dmoz – [[llms-kids-and-teens]]
9 is missing because obsolete. It was category ‘Netscape’
10 Curlie – News 26 Subcategories
dmoz – 25 subcategories [[llms-news]]
11 Curlie – Recreation 49 Subcategories
dmoz – 48 subcategories [[llms-recreation]]
12 Curlie – Reference 27 Subcategories
dmoz – [[llms-reference]]
13 Curlie – Regional 11 Subcategories
dmoz – [[llms-regional]]
14 Curlie – Science 35 Subcategories
dmoz – [[llms-science]]
- Technology Curlie – Science: Technology (63)
15 Curlie – Shopping 45 Subcategories
dmoz – no category page
see [[llms-shopping]]
Curlie – Shopping: Consumer Electronics Consumer Electronics Complete List
16 Curlie – Society 35 Subcategories
dmoz – 37 subcategories [[llms-society]]
dmoz – [[llms-sports]]
18 World is at language level – 92 languages (2 artificial + Latin)
dmoz – no category pages but many similar languages [[llms-world]]
NOTE: the numbers after the languages below are a category id for the language, and not useful except for keeping straight which language is which)
Curlie Languages (92)
See [[dmoz languages]]
EN English Curlie – The Collector of URLs
DE German 7648 Curlie – The Collector of URLs
FR French 7735 Curlie – The Collector of URLs
JA Japanese 7944 Curlie – The Collector of URLs
IT Italian 7929 Curlie – The Collector of URLs
ES Spanish 7678 Curlie – The Collector of URLs
RU Russian 9431 Curlie – The Collector of URLs
NL Dutch 9257 Curlie – The Collector of URLs
PL Polish 9360 Curlie – The Collector of URLs
TR Turkish 9753 Curlie – The Collector of URLs
Northern Europe
DA Danish 7633 Curlie – The Collector of URLs
SV Swedish 9566 Curlie – The Collector of URLs
NO Norwegian 9288 Curlie – The Collector of URLs
IS Icelandic 7914 Curlie – The Collector of URLs
FO Faroese ??? Curlie – The Collector of URLs
FI Finnish 9551 Curlie – The Collector of URLs
ET Estonian 7665 Curlie – The Collector of URLs
LT Lithuanian 9189 Curlie – The Collector of URLs
LV Latvian 9162 Curlie – The Collector of URLs
Western Europe Regional
CY Welsh 7617 Curlie – The Collector of URLs
GA Irish 7772 Curlie – The Collector of URLs
GD Scots Gaelic 7786 Curlie – The Collector of URLs
BR Breton 7527 Curlie – The Collector of URLs
FY West Frisian 7751 Curlie – The Collector of URLs
FRR North Frisian XXXX Curlie – The Collector of URLs
GEM Saterland Frisian 9462 Curlie – The Collector of URLs
LB Luxembourgish XXXX Curlie – The Collector of URLs
RM Romansh XXXX Curlie – The Collector of URLs
Southern Europe
PT Portuguese 9375 Curlie – The Collector of URLs
CA Catalan 7557 Curlie – The Collector of URLs
GL Galego 7801 Curlie – The Collector of URLs
EU Basque 7709 Curlie – The Collector of URLs
AST Asturian 7412 Curlie – The Collector of URLs
AN Aragonese Curlie – The Collector of URLs
FUR Furlan / Friulian Curlie – The Collector of URLs
SC Sardinian Curlie – The Collector of URLs
SCN Sicilian Curlie – The Collector of URLs
OC Occitan Curlie – The Collector of URLs
Central Europe
BE Belarussian 7498 Curlie – The Collector of URLs
CS Czech 7572 Curlie – The Collector of URLs
HU Hungarian 9219 Curlie – The Collector of URLs
SK Slovak 9522 Curlie – The Collector of URLs
UK Ukrainian 9784 Curlie – The Collector of URLs
CSB Kashubian 8874 Curlie – The Collector of URLs
TT Tatar 9625 Curlie – The Collector of URLs
BA Bashkir 7482 Curlie – The Collector of URLs
OS Ossetian 9315 Curlie – The Collector of URLs
Southeastern Europe
SL Slovene 9509 Curlie – The Collector of URLs
SR Serbian 9537 Curlie – The Collector of URLs
HR Croatian 7886 Curlie – The Collector of URLs
BS Bosnian 7512 Curlie – The Collector of URLs
BG Bulgarian 7541 Curlie – The Collector of URLs
SQ Albanian 9476 Curlie – The Collector of URLs
RO Romanian 9401 Curlie – The Collector of URLs
MK Macedonian 9235 Curlie – The Collector of URLs
EL Greek 7814 Curlie – The Collector of URLs
Middle East and Africa
IW Hebrew 7849 Curlie – The Collector of URLs
FA Persian 9346 Curlie – The Collector of URLs
AR Arabic XXXX Curlie – The Collector of URLs
KU Kurdish XXXX Curlie – The Collector of URLs
AZ Azerbaijani Curlie – The Collector of URLs
HY Armenian XXXX Curlie – The Collector of URLs
AF Afrikaans XXXX Curlie – The Collector of URLs
SW Swahili 8798 Curlie – The Collector of URLs
Central Asia
UZ Uzbek 9330 Curlie – The Collector of URLs
KK Kazakh 8783 Curlie – The Collector of URLs
KY Kyrgyz 9174 Curlie – The Collector of URLs
TG Tajik XXXX Curlie – The Collector of URLs
TK Turkmen XXXX Curlie – The Collector of URLs
UG Uyghurche XXX Curlie – The Collector of URLs
Southern Asia
HI Hindi 7865 Curlie – The Collector of URLs
SI Sinhalese 9494 Curlie – The Collector of URLs
GU Gujarati 7829 Curlie – The Collector of URLs
UR Urdu 9800 Curlie – The Collector of URLs
MR Marathi 9249 Curlie – The Collector of URLs
PA Punjabi Gurmukhi 9390 Curlie – The Collector of URLs
BN Bengali 7469 Curlie – The Collector of URLs
TA Tamil 9611 Curlie – The Collector of URLs
TE Telugu 9640 Curlie – The Collector of URLs
KN Kannada 8759 Curlie – The Collector of URLs
Eastern Asia
ZH-CN Chinese_Simplified 7588 Curlie – The Collector of URLs
ZH-TW Chinese_Traditional 7602 Curlie – The Collector of URLs
KO Korean 8808 Curlie – The Collector of URLs
CFR Taiwanese 9596 Curlie – The Collector of URLs
TH Thai 9654 Curlie – The Collector of URLs
VI Vietnamese 9758 Curlie – The Collector of URLs
IN Bahasa Indonesia – Indonesia 7440 Curlie – The Collector of URLs
MS Bahasa Melayu – Malay XXXX Curlie – The Collector of URLs
TL Tagalog 9581 Curlie – The Collector of URLs
Other
EO Esperanto 7694 Curlie – The Collector of URLs
IA Interlingua 7900 Curlie – The Collector of URLs
LA Latin 9204 Curlie – The Collector of URLs
Leave a Reply