Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

67 points

[-]

The full-session evaluation framing is the right call - most teams don't realize the failure happened in turn 2 until they've spent 3 hours blaming the model. One thing worth thinking about as you grow: connecting caught regressions to production conversation data. When your simulation flags a new failure mode, being able to say "this pattern has already surfaced X times in prod this week" cuts the prioritization debate in half. Does Cekura currently let you correlate simulation failures back to real user sessions, or is that still a manual step?

by atarus1 hours ago|

parent|

[-]

We track the failure modes in production directly instead of relying on simulation. So if suddenly we are seeing a failure mode pop up too often, we can alert timely. In the approach of going from simulation to monitoring, I am worried the feedback might be delayed.

Doing it in production also helps to go run simulations by replaying those production conversations ensuring you are handling regression.

by FailMore7 hours ago|

prev|

[-]

Any ideas how to solve the agent's don't have total common sense problem?

I have found when using agents to verify agents, that the agent might observe something that a human would immediately find off-putting and obviously wrong but does not raise any flags for the smart-but-dumb agent.

by atarus7 hours ago|

parent|

[-]

To clarify you are using the "fast brain, slow brain" pattern? Maybe an example would help.

Broadly speaking, we see people experiment with this architecture a lot often with a great deal of success. A few other approaches would be an agent orchestrator architecture with an intent recognition agent which routes to different sub-agents.

Obviously there are endless cases possible in production and best approach is to build your evals using that data.

by rush869994 hours ago|

parent|

prev|

[-]

Only solution is to train the issue for the next time.

Architecturally focusing on Episodic memory with feedback system.

This training is retrieved next time when something similar happens

by atarus4 hours ago|

parent|

[-]

Training is an overkill at this point imo. I have seen agents work quite well with a feedback loop, some tools and prompt optimisation. Are you doing fine-tuning on the models when you say training?

by rush869993 hours ago|

parent|

[-]

Nope - just use memory layer with model routing system.

https://github.com/rush86999/atom/blob/main/docs/EPISODIC_ME...

by atarus1 hours ago|

parent|

[-]

Memory is usually slow and haven't seen many voice agents atleast leverage it. Are you building in text modality or audio as well?

by guerython3 hours ago|

prev|

[-]

we treat each scenario as an explicit state machine. every conversation has checkpoints (ask for name, verify dob, gather phone) and the case only passes if each checkpoint flips true before the flow moves on. that means if the agent hallucinates, skips the verification step, or escalates to a human too early you get a session-level failure, not just a happily-green last turn. logging which checkpoint stayed false makes regressions obvious when you swap prompts/models.

by chrismychen4 hours ago|

prev|

[-]

How do you handle sessions where the correct outcome is an incomplete flow — e.g. the agent correctly refuses to move forwards because the caller failed verification, or correctly escalates to a human?

by atarus3 hours ago|

parent|

[-]

This comes from our architecture. Since we are aware of the agent's context our test agents know the incomplete flows and the assertions are per session.

If we miss some cases, there's always a feedback loop to help improve your test suite

by jamram823 hours ago|

prev|

[-]

Testing voice agents would require some kind of knowledge integration. Do you have any plans to support custom knowledge bases for test voice agents ?

by atarus3 hours ago|

parent|

[-]

Yes, we already support knowledge base integrations for BigQuery and plan to expand the set of connectors. You can always drop knowledge files currently.

Moreover, we even generate scenarios from the knowledge base

by niko-thomas3 hours ago|

prev|

[-]

We've tried a few platforms for voice agent testing and Cekura has been the best by a long shot. Keep up the great work!

by sidhantkabra9 hours ago|

prev|

[-]

Was really fun building this - would love feedback from the HN community and get insights on your current process.

by moinism8 hours ago|

prev|

[-]

congrats on the launch! do you guys have anything planned to test chat agents directly in the ui? I have an agent, but no exposed api so can't really use your product even though I have a genuine need.

by atarus7 hours ago|

parent|

[-]

Yes, we do support integrations with different chat agent providers and also SMS/Whastap agents where you can just drop a number of the agent.

Let us know how your agent can be connected to and we can advise best on how to test it.

by michaellee86 hours ago|

prev|

[-]

Interesting, I have built https://github.com/michaellee8/voice-agent-devkit-mcp exactly for this, launch a chromium instance with virtual devices powered by Pulsewire and then hook it up with tts and stt so that playwright can finally have mouth and ears. Any chance we can talk?

by atarus6 hours ago|

parent|

[-]

That's actually interesting. Is it a dependancy on user to create the HTTP endpoints for the /speak and /transcript?

One of our learnings has been to allow plugging into existing frameworks easily. Example - livekit, pipecat etc.

Happy to talk if you can reach out to me on linkedin - https://www.linkedin.com/in/tarush-agarwal/

by octoclaw5 hours ago|

prev|

[-]

[dead]

by berz017 hours ago|

prev|

[-]

[dead]