New Certification on GitHub administration
July 9, 2025Welcome back to Agent Support—a developer advice column for those head-scratching moments when you’re building an AI agent! Each post answers a real question from the community with simple, practical guidance to help you build smarter agents.
Today’s question comes from someone curious about measuring how well their agent responds:
💬Dear Agent Support
My agent seems to be responding well—but I want to make sure I’m not just guessing. Ideally, I’d like a way to check how accurate or helpful its answers really are. How can I measure the quality of my agent’s responses?
🧠 What are Evaluations?
Evaluations are how we move from “this feels good” to “this performs well.”
They’re structured ways of checking how your agent is doing, based on specific goals you care about.
At the simplest level, evaluations help answer:
- Did the agent actually answer the question?
- Was the output relevant and grounded in the right info?
- Was it easy to read or did it ramble?
- Did it use the tool it was supposed to?
That might mean checking if the model pulled the correct file in a retrieval task. Or whether it used the right tool when multiple are registered. Or even something subjective like if the tone felt helpful or aligned with your brand.
🎯 Why Do We Do Evaluations?
When you’re building an agent, it’s easy to rely on instinct. You run a prompt, glance at the response, and think: “Yeah, that sounds right.”
But what happens when you change a system prompt?
Or upgrade the model?
Or wire in a new tool?
Without evaluations, there’s no way to know if things are getting better or quietly breaking.
Evaluations help you:
- Catch regressions early: Maybe your new prompt is more detailed, but now the agent rambles. A structured evaluation can spot that before users do.
- Compare options: Trying out two different models? Testing a retrieval-augmented version vs. a base version? Evaluations give you a side-by-side look at which one performs better.
- Build trust in output quality: Whether you’re handing this to a client, a customer, or just your future self, evaluations help you say, “Yes, I’ve tested this. Here’s how I know it works.”
They also make debugging faster. If something’s off, a good evaluation setup helps you narrow down where it went wrong: Was the tool call incorrect? Was the retrieved content irrelevant? Did the prompt confuse the model?
Ultimately, evaluations turn your agent into a system you can improve with intention, not guesswork.
⏳ When Should I Start Evaluating?
Short answer: Sooner than you think!
You don’t need a finished agent or a fancy framework to start evaluating. In fact, the earlier you begin, the easier it is to steer things in the right direction.
Here’s a simple rule of thumb: If your agent is generating output, you can evaluate it.
That could be:
- Manually checking if it answers the user’s question
- Spotting when it picks the wrong tool
- Comparing two prompt versions to see which sounds clearer
Even informal checks can reveal big issues early before you’ve built too much around a flawed behavior.
As your agent matures, you can add more structure:
- Create a small evaluation set with expected outputs
- Define categories you want to score (like fluency, groundedness, relevance)
- Run batch tests when you update a model
Think of it like writing tests for code. You don’t wait until the end, you build them alongside your system. That way, every change gets feedback fast.
The key is momentum. Start light, then layer on depth as you go. You’ll save yourself debugging pain down the line, and build an agent that gets better over time.
📊 AI Toolkit
You don’t need a full evaluation pipeline or scoring rubric on day one. In fact, the best place to begin is with a simple gut check—run a few test prompts and decide whether you like the agent’s response. And if you don’t have a dataset handy, no worry! With the AI Toolkit, you can both generate datasets and keep track of your manual evaluations with the Agent Builder’s Evaluation feature.
Sidebar: If you’re curious about deeper eval workflows, like using AI to assist in judging your agent output against a set of evaluators like fluency, relevance Tool Call, or even custom evaluators, we’ll cover that in a future edition of Agent Support. For now, let’s keep it simple!
Here’s how to do it:
- Open the Agent Builder from the AI Toolkit panel in Visual Studio Code.
- Click the + New Agent button and provide a name for your agent.
- Select a Model for your agent.
- Within the System Prompt section, enter: You recommend a movie based on the user’s favorite genre.
- Within the User Prompt section, enter: What’s a good {{genre}} movie to watch?
- On the right side of the Agent Builder, select the Evaluation tab.
- Click the Generate Data icon (the first icon above the table).
- For the Rows of data to generate field, increase the total to 5.
- Click Generate.
You’re now ready to start evaluating the agent responses!
🧪 Test Before You Build
You can run the rows of data either individually or in bulk. I’d suggest starting with a single run to get an initial feel for how the feature works.
When you click Run, the agent’s response will appear in the response column. Review the output. In the Manual Evaluation column, select either thumb up or thumb down. You can continue to run the other rows or even add your own row and pass in a value for {{city}}.
Want to share the evaluation run and results with a colleague? Click the Export icon to save the run as a .JSONL file.
You’ve just taken the first step toward building a more structured, reliable process for evaluating your agent’s responses!
🔁 Recap
Here’s a quick rundown of what we covered:
- Evaluations help you measure quality and consistency in your agent’s responses.
- They’re useful for debugging, comparing, and iterating.
- Start early—even rough checks can guide better decisions.
- The AI Toolkit makes it easier to run and track evaluations right inside your workflow.
📺 Want to Go Deeper?
Check out my previous live-stream for AgentHack: Evaluating Agents where I explore concepts and methodologies for evaluating generative AI applications. Although I focus on leveraging the Azure AI Evaluation SDK, it’s still an invaluable intro to learning more about evaluations.
The Evaluate and Improve the Quality and Safety of your AI Applications lab from Microsoft Build 2025 provides a comprehensive self-guided introduction to getting started with evaluations. You’ll learn what each evaluator means, how to analyze the scores, and why observability matters—plus how to use telemetry data locally or in the cloud to assess and debug your app’s performance!
👉 Explore the lab: https://github.com/microsoft/BUILD25-LAB334/
And for all your general AI and AI agent questions, join us in the Azure AI Foundry Discord! You can find me hanging out there answering your questions about the AI Toolkit. I’m looking forward to chatting with you there!
Whether you’re debugging a tool call, comparing prompt versions, or prepping for production, evaluations are how you turn responses from plausible to dependable.