So far, I've observed that for Claude and Gemini, which are what we've been testing most with, the Language model has been pretty good at recognizing its a faulty initial interpretation when it queries more information from the system.
Running out of tokens is a more significant issue. We saw it a lot when we queried imaged topics, which led us to try writing better image interpreters within the MCP server itself (credit to my collaborators at the Hanyang university in Korea) to defend the context window. Free tiers of the language models also run out of tokens quite quickly.
PS - Thank you for the questions, I'm enjoying talking about this here on HN with people who look at it critically and challenge us!