upvote
Sometimes what is on the page is ambiguous. Imagine a scan where the dot over the i is missing in a word like "this". What's on the page is "thls" but to transcribe it that way would be an error outside of forensic contexts.

I am reminded it's basically impossible to read cursive writing in a language you don't know even if it's the same alphabet.

reply
Interesting. Won't stuff like entity extraction suffer? Especially in multilingual use cases. My worry is that a smaller model might not realize some text is actually a persons name because it is very unusual.
reply
The model does not need to be that smart to understand that a name it does not know that starts with a capital letter is a the name of a place or a person. It does not need to be aware of whom this refers to, it just needs to transcribe it.

Also, there are generalist models that have enough of a grasp of a dozen or so languages that fit comfortably in 7B parameters. Like the older Mistral, which had the best multi-lingual support at the time, but newer models around that size are probably good candidates. I am not surprised that a multilingual specialised model can fit in 8B or so.

reply