LLM's as judges and adversarial testing

A lot of gurus already teach the bits on how everyone can build agents with tonnes of skills to build their $30M ARR micro SaaS, or replace their $500K a year small business. So I'm going to go off the beaten track and cover a topic no one likes hearing: your agents can make mistakes, LLMs can hallucinate, and somehow you need to figure out when it happens and fix it.

I have this process where as I am building, I will have my co-pilot or LLM work out tests. It will execute the agent, run a couple of scenarios and prompts, and determine if the agent's responses measure up to pre-determined pass or fail conditions.

It keeps a record of all the tests we've done throughout the build, and at the end, I get co-pilot or coding agent to make me a scripted standardised test suite that we can run to score the agent's performance.

The first part of the tests involve my coding agent acting as a judge = how good the responses are and how well they stick to what we know to be reasonably good responses. Sometimes an LLM as a judge is not needed because some builds don't return subjective responses. They are needed when human-like reasoning is required to interpret responses.

The second part is adversarial testing = I get the coding agent to help me design scenarios intended to trip up or trick the agent's into giving the wrong answers.

I usually run these tests at every major milestone in the build and periodically when running the agents (even in production environments).
I walk through the scores with the coding agent to root cause and perform interim fixes
Then we monitor and run the scores again at a later time to see if the fixes held.
Tests and results are always recorded.
When we've run enough of these tests, they get turned into some automated gate for determining whether agents should be monitored closely, triaged, or discarded.

The screenshots are from my most recent build. I was designing a different, more compact, memory system, and needed to know if I could objectively trust (1) the responses coming from both the agent running the system and (2) the coding agent that's building it.

I use the transcripts and actual memory files to audit both.

6 comments