In general there aren't really models that can understand nuances of your speech yet. Gemini 2.5 voice mode changed that only recently and I think it can understand emotions but I'm not sure if it can detect things like accent and mispronouncing. The problem is data, we need a large corpus of data labeled how exactly the audio sample is mispronouncing the word, so the model can cluster those. Maybe self-learning techniques without human feedback can do it somehow. Other than that I'm not seeing how this is even possible to train such model with what's currently available.
Yes we do have this issue, but it's improved a bit over chatgpt due to using multiple transcribers.
The models are improving though, and they are at a very good place for English at the moment. I expect by next year we will switch over to full voice to voice models.
This reply seems to miss the question, or at least doesn’t answer it clearly. Is this service overly tolerant of mispronunciations? Foundational models are becoming more tolerant, not less, over time which is the opposite of what I’d want in this case.
It's less tolerant of mispronunciations. There is custom promting to explicitly leave in mistakes and to not fix them. It's still not perfect and it (the speech to text module) sometimes corrects the user's pronunciation mistakes.