Doing it in production also helps to go run simulations by replaying those production conversations ensuring you are handling regression.
I have found when using agents to verify agents, that the agent might observe something that a human would immediately find off-putting and obviously wrong but does not raise any flags for the smart-but-dumb agent.
Broadly speaking, we see people experiment with this architecture a lot often with a great deal of success. A few other approaches would be an agent orchestrator architecture with an intent recognition agent which routes to different sub-agents.
Obviously there are endless cases possible in production and best approach is to build your evals using that data.
Architecturally focusing on Episodic memory with feedback system.
This training is retrieved next time when something similar happens
https://github.com/rush86999/atom/blob/main/docs/EPISODIC_ME...
If we miss some cases, there's always a feedback loop to help improve your test suite
Moreover, we even generate scenarios from the knowledge base
Let us know how your agent can be connected to and we can advise best on how to test it.
One of our learnings has been to allow plugging into existing frameworks easily. Example - livekit, pipecat etc.
Happy to talk if you can reach out to me on linkedin - https://www.linkedin.com/in/tarush-agarwal/