An approach to AI evals for smarter, more reliable agents in Agency Swarm

Do not index

Inconsistent AI results waste time and create more work. That’s why I started using AI evals.

They help me test and refine my agents to deliver consistent, reliable outputs every time. Here’s how I use them with Agency Swarm, my favorite framework for AI agents, and how you can apply them to your systems.

1. Start With Manual Evaluation

Before setting up automated evaluations, I first perform the agent's tasks manually. This helps me define what constitutes a high-quality response and identify the characteristics of A+ execution. For example, when working with my research agent, I ran queries and carefully reviewed the results to understand what made them useful. This process allowed me to identify patterns in effective outputs and set clear expectations for the system's performance.

2. Define Clear Evaluation Criteria

After testing manually, I set criteria to measure performance. For my research agent, the criteria were:

Source usage: Did it use the provided sources?

Relevance: Did it answer the query?

Completeness: Did it include key points?

Overall quality: Was the response structured and actionable?

These criteria gave me a way to consistently measure the quality of outputs.

3. Automate With LLM-as-Judge

To save time, I used GPT-4o-mini(the cheapest model at the time of writing) to score outputs. I wrote test cases and set up a framework to compare the agent’s responses against expected results.

Example Test Case

{
    "test_id": "forum_trends",
    "query": "What are users' biggest frustrations?",
    "sources": ["url1", "url2"],
    "expected_points": [
        "API limitations",
        "Time tracking gaps"
    ]
}

This setup allowed me to test different prompts, models, and approaches quickly and consistently.

Evals make AI systems more consistent and scalable. They help you define success upfront, test against it, and reduce time spent on rework. Whether you’re building AI-powered tools or integrations, evals are essential.

And that’s it! What do you think? I’d love to hear your thoughts—feel free to share them. For more insights like this, subscribe to my newsletter.

An approach to AI evals for smarter, more reliable agents in Agency Swarm

1. Start With Manual Evaluation

2. Define Clear Evaluation Criteria

3. Automate With LLM-as-Judge

Example Test Case

We build apps for app marketplaces

Let's keep in touch