Trust, But Verify: Testing AI Agents

When Ronald Reagan said “trust, but verify,” he was referring to a Russian proverb “doveryai, no proveryai,” which he learned from his adviser Suzanne Massie and used during nuclear arms control negotiations with the former Soviet Union. In the same way, these powerful AI agents are useful but we need to test them. As much as we trust them, we need to verify them as much as we test and verify our code with unit and end to end testing.
AI agents are non-deterministic so you can’t do the standard ASSERT and get an expected response. Instead, we do an evaluation of the agent.
This is a core DevOps principle in CI/CD as well as other Agile coding frameworks, to test all your code with automated tests. That way you can have confidence in the behavior of your AI agent application.
To illustrate how this process would look, I’ve created a simple REPL chatbot which uses an LLM and acts as an expert on gardening and raising tomatoes.
Breaking Down the Agent for Testing

When you start developing tests for LLMs, you need to consider that as these tests are non-deterministic, the standard process for unit testing is to break down the program into smaller components. In the same way, you take your AI agent and break it down into smaller components that are testable.
However, what do you really test since it is non-deterministic? You cannot judge it by the answer as much as you have to judge it by whether it follows certain actions. For example, when it is told to use a tool such as a search engine MCP Server, does it actually execute the command for acquiring a tool or accessing a database and so on and so forth?
In the case of the AI Tomato Chat App, I have it evaluate the following:
- TestTomatoExpertiseQuality: Core quality metrics
test_answer_relevancy– Parametrized test with 3 tomato Q&A pairstest_zero_toxicity_friendly_response– Validates encouragement responsestest_zero_toxicity_pest_response– Validates pest control responsestest_off_topic_rejection_quality– Ensures polite refusalstest_topic_adherence_planting– Custom metric for planting advicetest_topic_adherence_disease– Custom metric for disease advice
- TestFaithfulness: Factual accuracy verification
test_ph_level_accuracy– Validates pH range 6.0-6.8test_spacing_accuracy– Validates plant spacing guidelinestest_watering_advice_accuracy– Validates watering recommendations
- TestChatbotIntegration: Integration tests with mock LLM and DeepEval
test_chatbot_produces_relevant_response– End-to-end container growing testtest_chatbot_refusal_is_polite– Off-topic refusal toxicity check
- TestExpertiseScenarios: Domain-specific scenarios
test_expertise_coverage– Parametrized test validating 4 scenarios (disease, pruning, climate, blossom end rot)test_variety_knowledge– Validates knowledge of 9 common tomato varieties
- TestOffTopicResponse: Off-topic handling quality
test_off_topic_response_structure– Validates polite, helpful structuretest_off_topic_response_not_dismissive– Ensures no dismissive language
If you are able to consistently have the AI agent via the prompt execute tasks or commands, and you’re able to test this and test for consistency, then the AI agent and the prompt actually becomes code that’s reliable. It’s an actual piece of software that is testable versus something that is not.
So you test for a number of things: you test whether it uses tools, you test for whether there’s consistency in its answers, whether it is able to follow through, and you test for whether the answers are hallucinated.
AI as a Judge

The second component that you need besides breaking down the tasks and breaking down the AI into smaller parts that can be tested is to use an AI as a judge. Chip Huyen in her book AI Engineering describes this process of using an AI as a judge. She says to use the second most powerful AI to act as a judge. For example, if you’re using GPT-4 or GPT-5, use GPT-4 as an evaluator judge.
The AI in this case would generate various inputs and then establish what the criteria for the output should be. Based on that, you can grade how the AI agent performed. If you find that there is something that’s inconsistent, this is where you would update or change the prompts and make the adjustments. As I’ve said in earlier articles, use an AI to help you write the prompts—just factor into it what’s happening with the prompts.
Testing Tools and Frameworks
What type of tools do I use for this? In this example I used DeepEval for unit testing the application. DeepEval is set up to work with Python and work like pytest except it tests the AI for how it responds and how it works. The AI_TEST.md covers the evaluation of the AI using DeepEval.
There are other frameworks for testing and here are some alternatives worth exploring:
- LangSmith – Observability and evaluation platform by the LangChain team
- Ragas – Framework specifically built for RAG pipeline evaluation
- MLflow – Modular package for running evaluations in your own pipelines
- TruLens – Open-source library focused on qualitative analysis of LLM responses
- Opik – Open-source LLM evaluation platform by Comet
- Langfuse – Open-source LLM engineering platform for observability and evaluation
Logging and User Feedback
The other thing to consider implementing in your application is a place for user feedback. User feedback is important because that can tell you what direction things are going. You need to have traceability and you need to be able to test your application, so you add a place to log how API calls are made and how the interactions occur, including what kind of thought process was involved in the logs.
I would incorporate logging into the AI application if at all possible. But what I would be logging in particular, since it is non-deterministic and the answers will be different every time, is whether it is doing stuff like using proper tools or doing proper calls or accessing resources or following rules or guidelines in the prompts.
Prompts as Code
This also brings up another point: treat your prompts that you develop for your AI agent as code itself. In this sense, you are also unit testing the prompts as well as the overall AI application with end-to-end tests and unit tests of each individual component. This aligns with Test-Driven Development (TDD) principles where you write tests before writing code, ensuring your prompts meet defined criteria before deployment.
While you cannot expect determinism from an AI application, you can expect it to have certain consistencies. One of the things you do when you develop an agentic application is set the temperature to zero versus the temperature to one, meaning it will work without too many variations offered in its behavior.
There are other kinds of factors to use or to consider, but these are some of the basics. These basics will change over time as the LLMs and the AI technology becomes more advanced and as we make new discoveries.
I am open to feedback and welcome what you have to say. Otherwise, have a nice day.
References
Historical & Conceptual
AI Engineering & LLM Evaluation
- Chip Huyen – AI Engineering: Building Applications with Foundation Models
- Chip Huyen’s Website
- DeepEval – GitHub
- DeepEval Documentation
- Does Temperature 0 Guarantee Deterministic LLM Outputs?
- LLM Temperature: How It Works and When You Should Use It
- LLM Prompts as Code
- Prompt Version Control – Langfuse
DevOps & CI/CD
- What is CI/CD? – Red Hat
- What Are CI/CD And The CI/CD Pipeline? – IBM
- What is Automated Testing in Continuous Delivery? – JetBrains
- CI/CD Best Practices to Streamline DevOps – LaunchDarkly
Agile & Test-Driven Development
- What is Test Driven Development (TDD)? – Agile Alliance
- Test Driven Development – Martin Fowler
- Test-Driven Development – Wikipedia
- Test-Driven Development – Scaled Agile Framework