> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.
I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?
edit: I stand corrected lol. I'll go with "Gaelic" instead.
39 million people speak Polish, and most of those also speak English or another more common language.
Try sticking to the supported languages
The base likely was pretrained on days that included Polish and Ukrainian. You shouldn't be surprised to learn it doesn't perform great on languages it wasn't trained on, or perhaps had the highest share of training data.
Polish works with the Latin alphabet just fine.
"Do kraju tego, gdzie kruszynę chleba podnoszą z ziemi przez uszanowanie dla darów Nieba.... Tęskno mi, Panie..."
"Mimozami jesień się zaczyna, złotawa, krucha i miła. To ty, to ty jesteś ta dziewczyna, która do mnie na ulicę wychodziła."