upvote
I did transcription for a while in 2021. It is absurdly hard. Especially as these days humans only get the difficult jobs that AI has already taken a stab at.

The hardest one I did was for a sports network where it was a motorcross motorbike event where most of what you could hear was the roar of the bikes. There were two commentators I had to transcribe over the top of that mess and they were using the slang insider nicknames for all the riders, not their published names, so I had to sit and Google forums to find the names of the riders while I was listening. I'm not even sure how these local models would even be able to handle that insanity at all because they almost certainly lack enough domain knowledge.

reply
Oh wow, I thought humans are like 0.1% error rate, if they are native speakers and aware of the subject being discussed.
reply
I was skepitcal upon hearing the figure but various sources do indeed back it up and [0] is a pretty interesting paper (old but still relevant human transcibers haven't changed in accuracy).

[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...

reply
I think it's actually hard to verify how correct a transcription is, at scale. Curious where those error rate numbers come from, because they should test it on people actually doing their job.
reply
It can depend a lot on different factors like:

- familiarity with the accent and/or speaker;

- speed and style/cadence of the speech;

- any other audio that is happening that can muffle or distort the audio;

- etc.

It can also take multiple passes to get a decent transcription.

reply
You missed a giant factor: domain knowledge. Transcribing something outside of your knowledge realm is very hard. I posted above about transcribing the commentary of a motorbike race where the commentators only used the slang names of the riders.
reply
Most of these errors will not be meaningful. Real speech is full of ambiguities. 3% is low
reply