Finding Language Examples and Information on the Web

Finding Language Examples and Information on the Web | Phora Nova

The Phora is an OSINT forum, so every once in a while we publish collections of links ;)

I was working on a project to index certain files being used by LLMs to navigate the web, and found some older but still maintained resources.

Back ca. 1998, the Open Directory Project started to create an index of the known web.  Over time, this was picked up by Netscape as the DMOZ directory, in the days of the Mozilla browser.  At its height, ca. 2006, it had 95,000 curators (editors) and over a million categories.   AOL took over Netscape, but work continued on DMOZ.

Then in 2017, for reasons unknown (money, likely), AOL decided to kill funding for the project.  However, the Open Source community stepped up to the plate and continued the project as ‘Curlie’ — losing the lizard icon and selecting a red squirrel.

Be forewarned that it is a rare category that has more than say 10 items, but the level of curation compensates for that, and you can only find this way.  The search function is aughties tier still, probably original, which means it is helpful but so-so by modern standards.

I predict that such human curation will become more important in the future, as AI slop takes over search engines and the web in general.

As a sample of the scope of indexed pages here is the coverage of languages from a linguistics perspective:

[URL unfurl=”true”]https://curlie.org/en/Science/Social_Sciences/Linguistics/Languages/Natural/[/URL]

The index itself sports 92 language translations which I list below (Scroll past the table of contents to the end for the languages)

Curlie – The Collector of URLs

Curlie – Adult – hidden from top level and structure only (no content)

dmoz – 15 subcategories

Curlie – Arts 54 subcategories

dmoz – 55 subcategories [[llms-arts]]

Curlie – Business 48 subcategories

dmoz – [[llms-business]]

Curlie – Computers 54 subcategories

dmoz – 53 subcategories

[[llms-computers]]

Curlie – Games 32 Subcategories

dmoz – 31 subcategories [[llms-games]]

Curlie – Health 47 Subcategories

dmoz – [[llms-health]]

Curlie – Home 25 Subcategories

dmoz – no category page [[llms-homes]]

Curlie – Kids and Teens 14 Subcategories [[Curlie L2 Kids and Teens]]

dmoz – [[llms-kids-and-teens]]

9 is missing because obsolete.  It was category ‘Netscape’

10 Curlie – News 26 Subcategories

dmoz – 25 subcategories [[llms-news]]

11 Curlie – Recreation 49 Subcategories

dmoz – 48 subcategories [[llms-recreation]]

12 Curlie – Reference 27 Subcategories

dmoz – [[llms-reference]]

13 Curlie – Regional 11 Subcategories

dmoz – [[llms-regional]]

14 Curlie – Science 35 Subcategories

dmoz – [[llms-science]]

15 Curlie – Shopping 45 Subcategories

dmoz – no category page

see [[llms-shopping]]

Curlie – Shopping: Consumer Electronics Consumer Electronics Complete List

16 Curlie – Society 35 Subcategories

dmoz – 37 subcategories [[llms-society]]

17 Curlie – Sports

dmoz – [[llms-sports]]

18 World is at language level – 92 languages (2 artificial + Latin)

dmoz – no category pages but many similar languages [[llms-world]]

NOTE: the numbers after the languages below are a category id for the language, and not useful except for keeping straight which language is which)

Curlie Languages (92)

See [[dmoz languages]]

EN English Curlie – The Collector of URLs

DE German 7648 Curlie – The Collector of URLs

FR French 7735 Curlie – The Collector of URLs

JA Japanese 7944 Curlie – The Collector of URLs

IT Italian 7929 Curlie – The Collector of URLs

ES Spanish 7678 Curlie – The Collector of URLs

RU Russian 9431 Curlie – The Collector of URLs

NL Dutch 9257 Curlie – The Collector of URLs

PL Polish 9360 Curlie – The Collector of URLs

TR Turkish 9753 Curlie – The Collector of URLs

Northern Europe

DA Danish 7633 Curlie – The Collector of URLs

SV Swedish 9566 Curlie – The Collector of URLs

NO Norwegian 9288 Curlie – The Collector of URLs

IS Icelandic 7914 Curlie – The Collector of URLs

FO Faroese ??? Curlie – The Collector of URLs

FI Finnish 9551 Curlie – The Collector of URLs

ET Estonian 7665 Curlie – The Collector of URLs

LT Lithuanian 9189 Curlie – The Collector of URLs

LV Latvian 9162 Curlie – The Collector of URLs

Western Europe Regional

CY Welsh 7617 Curlie – The Collector of URLs

GA Irish 7772 Curlie – The Collector of URLs

GD Scots Gaelic 7786 Curlie – The Collector of URLs

BR Breton 7527 Curlie – The Collector of URLs

FY West Frisian 7751 Curlie – The Collector of URLs

FRR North Frisian XXXX Curlie – The Collector of URLs

GEM Saterland Frisian 9462 Curlie – The Collector of URLs

LB Luxembourgish XXXX Curlie – The Collector of URLs

RM Romansh XXXX Curlie – The Collector of URLs

Southern Europe

PT Portuguese 9375 Curlie – The Collector of URLs

CA Catalan 7557 Curlie – The Collector of URLs

GL Galego 7801 Curlie – The Collector of URLs

EU Basque 7709 Curlie – The Collector of URLs

AST Asturian 7412 Curlie – The Collector of URLs

AN Aragonese Curlie – The Collector of URLs

FUR Furlan / Friulian Curlie – The Collector of URLs

SC Sardinian Curlie – The Collector of URLs

SCN Sicilian Curlie – The Collector of URLs

OC Occitan Curlie – The Collector of URLs

Central Europe

BE Belarussian 7498 Curlie – The Collector of URLs

CS Czech 7572 Curlie – The Collector of URLs

HU Hungarian 9219 Curlie – The Collector of URLs

SK Slovak 9522 Curlie – The Collector of URLs

UK Ukrainian 9784 Curlie – The Collector of URLs

CSB Kashubian 8874 Curlie – The Collector of URLs

TT Tatar 9625 Curlie – The Collector of URLs

BA Bashkir 7482 Curlie – The Collector of URLs

OS Ossetian 9315 Curlie – The Collector of URLs

Southeastern Europe

SL Slovene 9509 Curlie – The Collector of URLs

SR Serbian 9537 Curlie – The Collector of URLs

HR Croatian 7886 Curlie – The Collector of URLs

BS Bosnian 7512 Curlie – The Collector of URLs

BG Bulgarian 7541 Curlie – The Collector of URLs

SQ Albanian 9476 Curlie – The Collector of URLs

RO Romanian 9401 Curlie – The Collector of URLs

MK Macedonian 9235 Curlie – The Collector of URLs

EL Greek 7814 Curlie – The Collector of URLs

Middle East and Africa

IW Hebrew 7849 Curlie – The Collector of URLs

FA Persian 9346 Curlie – The Collector of URLs

AR Arabic XXXX Curlie – The Collector of URLs

KU Kurdish XXXX Curlie – The Collector of URLs

AZ Azerbaijani Curlie – The Collector of URLs

HY Armenian XXXX Curlie – The Collector of URLs

AF Afrikaans XXXX Curlie – The Collector of URLs

SW Swahili 8798 Curlie – The Collector of URLs

Central Asia

UZ Uzbek 9330 Curlie – The Collector of URLs

KK Kazakh 8783 Curlie – The Collector of URLs

KY Kyrgyz 9174 Curlie – The Collector of URLs

TG Tajik XXXX Curlie – The Collector of URLs

TK Turkmen XXXX Curlie – The Collector of URLs

UG Uyghurche XXX Curlie – The Collector of URLs

Southern Asia

HI Hindi 7865 Curlie – The Collector of URLs

SI Sinhalese 9494 Curlie – The Collector of URLs

GU Gujarati 7829 Curlie – The Collector of URLs

UR Urdu 9800 Curlie – The Collector of URLs

MR Marathi 9249 Curlie – The Collector of URLs

PA Punjabi Gurmukhi 9390 Curlie – The Collector of URLs

BN Bengali 7469 Curlie – The Collector of URLs

TA Tamil 9611 Curlie – The Collector of URLs

TE Telugu 9640 Curlie – The Collector of URLs

KN Kannada 8759 Curlie – The Collector of URLs

Eastern Asia

ZH-CN Chinese_Simplified 7588 Curlie – The Collector of URLs

ZH-TW Chinese_Traditional 7602 Curlie – The Collector of URLs

KO Korean 8808 Curlie – The Collector of URLs

CFR Taiwanese 9596 Curlie – The Collector of URLs

TH Thai 9654 Curlie – The Collector of URLs

VI Vietnamese 9758 Curlie – The Collector of URLs

IN Bahasa Indonesia – Indonesia 7440 Curlie – The Collector of URLs

MS Bahasa Melayu – Malay XXXX Curlie – The Collector of URLs

TL Tagalog 9581 Curlie – The Collector of URLs

Other

EO Esperanto 7694 Curlie – The Collector of URLs

IA Interlingua 7900 Curlie – The Collector of URLs

LA Latin 9204 Curlie – The Collector of URLs

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *