The hardest one I did was for a sports network where it was a motorcross motorbike event where most of what you could hear was the roar of the bikes. There were two commentators I had to transcribe over the top of that mess and they were using the slang insider nicknames for all the riders, not their published names, so I had to sit and Google forums to find the names of the riders while I was listening. I'm not even sure how these local models would even be able to handle that insanity at all because they almost certainly lack enough domain knowledge.
[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...
- familiarity with the accent and/or speaker;
- speed and style/cadence of the speech;
- any other audio that is happening that can muffle or distort the audio;
- etc.
It can also take multiple passes to get a decent transcription.