upvote
Common Crawl has been running a low-resource language project for 1.5 years now -- it's a hard problem.
reply
There’s many nation states working on this, have you looked into availability of those data sets?

What languages are you prioritizing?

reply
Yes, there are government datasets, languge "acadamies" (or "regulators") - organizations focused on preserving / teaching the language, and often smaller, local publishers that publish material in their local language.

I'm living in Guatemala, so have been focusing on the Mayan languages here (22 languages, millions of speakers).

reply
As an aside, I remember visiting Guatemala (in the border area near Chiapas) in the early 90s and discovering that “Mayan” was not the monolith that I had been led to believe by my culturally narrow American education, but was a diverse collection of related cultures with multiple languages.

In one of the villages we visited, there was a language school where foreigners were learning Jacalteco. One student was from Israel and where most of the students had vocabulary lists in three columns (Jacalteco - Spanish - English), his had four columns where he did one more step of translation to Hebrew.

reply