I guess I am slow to the party here, but this thread was brought to my attention, so I will try to answer based on what I know about the topic. You can get good results using the built-in Agent Evals workflow when the agent is built on LiveKit Agents. Instead of trying to manually āplay callsā over and over, write small behavioral tests (pytest + pytest-asyncio) that exercise a turn or two of conversation and assert what the agent should do ā tone, tool usage, error handling, grounding, etc. A few things that have worked well: - Text-only evaluation first (fast + cheap), then only do audio tests when needed - judge() for qualitative checks, e.g., ādid it politely greet and offer help?ā - Tool-call assertions to confirm the right tool was called with the right args - Mocked tools to force errors and edge cases without touching real systems - Multi-turn tests so we can catch regressions in memory and workflows - Run the tests in CI with LLM keys as secrets, so every PR gets evaluated Itās been great for catching subtle regressions, especially when we tweak prompts or add new capabilities, without breaking older flows. I know many folks I have spoken to also find value in 3rd party tools like: https://getbluejay.ai/ https://hamming.ai/ If you are in the San Francisco area on January 27th and your focus for 2026 is agent reliability, we are having a meetup with industry experts on voice agent reliability techniques and testing. I will share the Luma here in a few days, in case you can join. Another great tool is to use Agent tasks (I've seen some call this templates) and Task groups, particularly if you need to use a specific flow across many customers. You can refine the task and reuse it. You can learn more here https://docs.livekit.io/agents/logic/tasks/ When diagnosing issues, it can be super helpful to use Langfuse or a feature like LiveKit Agents' observability. Here is a nice overview of how and why you would use something like this https://docs.livekit.io/deploy/observability/insights/ (YouTube demo)