How to eval agent skills

I stopped manually spot-checking skill edits and moved to a fixed eval loop. The result was faster revisions, clearer failures, and more trustworthy agent behavior.

How to eval agent skills
Do not index
I hit a point this week that felt small at first and expensive in practice. Writing skills was not my bottleneck anymore. Trusting them was.
I could ship changes quickly, but I was still validating each update with manual spot checks. I would run one prompt, skim the result, tweak instructions, and repeat. It looked productive until subtle edits changed behavior in ways I did not catch.

The workflow pain I could not ignore

The real problem was inconsistency in my review process. Every check depended on my memory and attention in that moment. That made quality feel subjective, and subjective quality does not scale when you are a one-woman operator trying to run delivery with AI.
In practice, this created three issues:
  • I missed regressions after seemingly minor instruction changes.
  • I could not compare revisions against a stable baseline.
  • I spent too much time proving a change was safe.

What I changed in the system

I adapted Claude's skill creator for my OpenCode setup in , including built-in evals.
Instead of ad hoc checks, I created a fixed eval set using the same prompts and source documents every run. Then I treated each skill edit like a code change with a test harness:
  1. Run evals on the current revision.
  1. Review failed cases like traces, not opinions.
  1. Patch the skill instructions.
  1. Rerun the exact same eval set.
That gave me a repeatable QA gate before I considered a skill "ready" for real workflows.

The result

My feedback loop shifted from manual spot checks to fast, repeatable eval passes on the same test set.
The speed gain was useful, but the bigger win was decision quality. I stopped arguing with output vibes and started making changes based on comparable evidence.

What builders can copy

If you are building agent systems, do not start by polishing prompts endlessly. Start by hardening your evaluation loop.
A fixed eval set gives you traceability, cleaner handoffs, and better project hygiene. Once that exists, prompt and skill improvements become easier to trust because you can see what changed, why it changed, and whether it actually improved outcomes.
I'm building PailFlow in the open and sharing how I use AI systems to scale a one-woman business.
If you work in client services and want to see how AI can increase your project delivery capacity, book a PailFlow Delivery Audit.

Written by

Lola
Lola

Lola is the founder of Lunch Pail Labs. She enjoys discussing product, app marketplaces, and running a business. Feel free to connect with her on Twitter or LinkedIn.