Trust But Verify: Testing AI Agents

Trust, But Verify: Testing AI Agents

When Ronald Reagan said “trust, but verify,” he was referring to a Russian proverb “doveryai, no proveryai,” which he learned from his adviser Suzanne Massie and used during nuclear arms control negotiations with the former Soviet Union. In the same way, these powerful AI agents are useful but we need to test them. As much as we trust them, we need to verify them as much as we test and verify our code with unit and end to end testing.

AI agents are non-deterministic so you can’t do the standard ASSERT and get an expected response. Instead, we do an evaluation of the agent.

This is a core DevOps principle in CI/CD as well as other Agile coding frameworks, to test all your code with automated tests. That way you can have confidence in the behavior of your AI agent application.

To illustrate how this process would look, I’ve created a simple REPL chatbot which uses an LLM and acts as an expert on gardening and raising tomatoes.

Breaking Down the Agent for Testing

When you start developing tests for LLMs, you need to consider that as these tests are non-deterministic, the standard process for unit testing is to break down the program into smaller components. In the same way, you take your AI agent and break it down into smaller components that are testable.

However, what do you really test since it is non-deterministic? You cannot judge it by the answer as much as you have to judge it by whether it follows certain actions. For example, when it is told to use a tool such as a search engine MCP Server, does it actually execute the command for acquiring a tool or accessing a database and so on and so forth?
In the case of the AI Tomato Chat App, I have it evaluate the following:

  1. TestTomatoExpertiseQuality: Core quality metrics
    • test_answer_relevancy – Parametrized test with 3 tomato Q&A pairs
    • test_zero_toxicity_friendly_response – Validates encouragement responses
    • test_zero_toxicity_pest_response – Validates pest control responses
    • test_off_topic_rejection_quality – Ensures polite refusals
    • test_topic_adherence_planting – Custom metric for planting advice
    • test_topic_adherence_disease – Custom metric for disease advice
  2. TestFaithfulness: Factual accuracy verification
    • test_ph_level_accuracy – Validates pH range 6.0-6.8
    • test_spacing_accuracy – Validates plant spacing guidelines
    • test_watering_advice_accuracy – Validates watering recommendations
  3. TestChatbotIntegration: Integration tests with mock LLM and DeepEval
    • test_chatbot_produces_relevant_response – End-to-end container growing test
    • test_chatbot_refusal_is_polite – Off-topic refusal toxicity check
  4. TestExpertiseScenarios: Domain-specific scenarios
    • test_expertise_coverage – Parametrized test validating 4 scenarios (disease, pruning, climate, blossom end rot)
    • test_variety_knowledge – Validates knowledge of 9 common tomato varieties
  5. TestOffTopicResponse: Off-topic handling quality
    • test_off_topic_response_structure – Validates polite, helpful structure
    • test_off_topic_response_not_dismissive – Ensures no dismissive language

If you are able to consistently have the AI agent via the prompt execute tasks or commands, and you’re able to test this and test for consistency, then the AI agent and the prompt actually becomes code that’s reliable. It’s an actual piece of software that is testable versus something that is not.

So you test for a number of things: you test whether it uses tools, you test for whether there’s consistency in its answers, whether it is able to follow through, and you test for whether the answers are hallucinated.

AI as a Judge

The second component that you need besides breaking down the tasks and breaking down the AI into smaller parts that can be tested is to use an AI as a judge. Chip Huyen in her book AI Engineering describes this process of using an AI as a judge. She says to use the second most powerful AI to act as a judge. For example, if you’re using GPT-4 or GPT-5, use GPT-4 as an evaluator judge.

The AI in this case would generate various inputs and then establish what the criteria for the output should be. Based on that, you can grade how the AI agent performed. If you find that there is something that’s inconsistent, this is where you would update or change the prompts and make the adjustments. As I’ve said in earlier articles, use an AI to help you write the prompts—just factor into it what’s happening with the prompts.

Testing Tools and Frameworks

What type of tools do I use for this? In this example I used DeepEval for unit testing the application. DeepEval is set up to work with Python and work like pytest except it tests the AI for how it responds and how it works. The AI_TEST.md covers the evaluation of the AI using DeepEval.

There are other frameworks for testing and here are some alternatives worth exploring:

  • LangSmith – Observability and evaluation platform by the LangChain team
  • Ragas – Framework specifically built for RAG pipeline evaluation
  • MLflow – Modular package for running evaluations in your own pipelines
  • TruLens – Open-source library focused on qualitative analysis of LLM responses
  • Opik – Open-source LLM evaluation platform by Comet
  • Langfuse – Open-source LLM engineering platform for observability and evaluation

Logging and User Feedback

The other thing to consider implementing in your application is a place for user feedback. User feedback is important because that can tell you what direction things are going. You need to have traceability and you need to be able to test your application, so you add a place to log how API calls are made and how the interactions occur, including what kind of thought process was involved in the logs.

I would incorporate logging into the AI application if at all possible. But what I would be logging in particular, since it is non-deterministic and the answers will be different every time, is whether it is doing stuff like using proper tools or doing proper calls or accessing resources or following rules or guidelines in the prompts.

Prompts as Code

This also brings up another point: treat your prompts that you develop for your AI agent as code itself. In this sense, you are also unit testing the prompts as well as the overall AI application with end-to-end tests and unit tests of each individual component. This aligns with Test-Driven Development (TDD) principles where you write tests before writing code, ensuring your prompts meet defined criteria before deployment.

While you cannot expect determinism from an AI application, you can expect it to have certain consistencies. One of the things you do when you develop an agentic application is set the temperature to zero versus the temperature to one, meaning it will work without too many variations offered in its behavior.

There are other kinds of factors to use or to consider, but these are some of the basics. These basics will change over time as the LLMs and the AI technology becomes more advanced and as we make new discoveries.

I am open to feedback and welcome what you have to say. Otherwise, have a nice day.


References

Historical & Conceptual

AI Engineering & LLM Evaluation

DevOps & CI/CD

Agile & Test-Driven Development

Testing Frameworks