With OCR the risk is you get another xerox[1] incident where all your data looks plausible but is incorrect. Hope you kept the originals!
(This is why for my personal doc scans, I use OCR only for full text search, but retain the original raw scans forever)
[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
For example, if the prompt includes that Caitlin is an accountant and Kaitlyn is an engineer, if you transcribe "Tell Kaitlyn to review my PR" it will know who you're referring to. That's something WER doesn't really capture.
BTW, I built an open-source Mac tool for using gpt-4o-transcribe with an OpenAI API key and custom prompts: https://github.com/corlinp/voibe
Probably the answer is simply to tweak the metric so it's a bit more smart than WER - allow "unclear" output which is penalised less than actually incorrect answers. I'd be surprised if nobody has done that.