I think most folks already wouldn't be able to tell, with the modern TTS.
It's like AI photos, they fool you unless you're looking for it.
So, I agree. But I believe the problem is pretty solvable with enough tokens.
This is the critical data —» how many people hang up on the AI chatbot vs how many people hang up on the voice message prompt.
If it is even close, well, the AI needs to be improved.
If the AI is way ahead, but still loses/drops more than a live receptionist (outsourced or in-house), the AI either needs improvement, or to be dumped for a live receptionist, and that's kind of a spreadsheet problem (how many jobs lost in each case, vs costs).
But the real question you should also ask is what else can that human do for you that the AI can't because they have eyes and ears and hands?