How to make your agents prove your product works

Do not index

I have an internal agent I use to run Lunch Pail Labs and do a lot of my day-to-day work. I made parts of that system accessible through PailFlow, and now I have connected flows across Slack, browser pages, email, Supabase, Stripe, sandboxes, and OpenCode running in the cloud.

When a system has that many seams, testing one piece is not enough.

A prompt eval can tell me whether an agent answered the way I expected. A unit test can tell me whether one function behaved. An integration test can tell me whether two services exchanged data. All very valuable, but they don't answer the question I actually care about before shipping:

Could a person complete the workflow?

That question catches a different class of problems. The button might work, but the page might send someone to the wrong state. The email might send, but the magic link might not land the user in the right place. The agent might know how to call a tool, but the Slack experience might still fail because the bot did not respond in the right thread or the scheduled follow-up never arrived.

Behavioral agent tests are useful because they let me test the experience from the outside. I can give an agent a workflow, ask it to behave like a user, and require evidence that the outcome actually happened.

Here's how it works.

Write the skill around the human workflow

At the heart of this approach is agent skills. Whatever test harness you are using, the useful unit of product memory is the skill: a folder that can hold the instructions, setup steps, scripts, safety notes, and verification behavior for one workflow.

Pro tip: if you need help creating skills, start with the skill creator skill.

Pick one critical workflow. For me, good candidates are:

A user signs in with email and lands in the dashboard.

A waitlist visitor submits interest and completes the priority access flow.

A Slack user asks the agent to create, list, and clean up an automation.

Then turn that happy path into the skill. Write when the skill should be used, what environment it should run against, and what it is supposed to touch.

The next question is tool access: what can the agent connect to so it can verify its own work?

If it's a web app, give it browser automation so it can open the product, click the UI, and read the visible result.

Give it a test email account so it can confirm magic links, receipts, invites, and notifications actually arrive (shout out to AgentMail).

Give it database read access, or a narrow verification query, so it can confirm the backend state changed.

Give it integration access like a dedicated Slack test channel or dev-user script so it can send the same kind of message a human would send.

The skill becomes product memory. It captures what “working” means in terms a human would recognize and gives an agent a way to test like a human.

Examples of testing skills I run

Here are some examples of the testing skills I currently run:

One skill tests my landing page signup flow. The agent opens the sign-in page, enters a test email, submits the form, checks that the UI says to check email, retrieves the magic link from an agent-accessible inbox, opens the link, and verifies that the authenticated dashboard loads.

Another skill tests a waitlist flow. The agent opens a specific waitlist page, fills out the form, confirms the success state, clicks the priority access CTA, completes Stripe Checkout with a test card, checks webhook ingestion, verifies the Supabase record, and confirms that the success page shows the right next step.

I also have a Slack agent skill. Instead of mocking the conversation, the eval posts into a dedicated Slack test channel as the dev user. It tags the PailFlow bot, asks it a query, asks it to schedule a recurring daily automation, waits for the scheduled DM to arrive, lists automations, deletes the one it created, and confirms cleanup. That one skill tests the sandbox, Slack ingestion, webhooks, database access, automation scheduling, cleanup, and whether the answer still comes back to the user in a sensible way.

If you want to push this further, have the agent send you a screen recording. I find this especially helpful for browser automations and for agent-created PRs that make meaningful changes to the user experience. The recording shows what a pass/fail result cannot: whether the page felt slow, whether the copy made sense, whether the wrong thing flashed on screen, or whether the flow technically worked but looked awkward.

Caveats

This is not a replacement for other tests. I still want unit tests, integration tests, prompt evals, and code review. Behavioral agent tests sit above those layers. They tell me whether the pieces still add up to a usable product experience.

They are also slower and more expensive than normal tests. A browser run that starts servers, checks an inbox, waits on Slack, or follows Stripe webhooks should not run on every tiny code change. I treat these as smoke tests for important workflows, release checks, or regression tests when I touch a flow that has broken before.

The safety layer matters too. Use test accounts, test cards, dedicated Slack channels, scoped database access, and cleanup rules. The agent should know what it created during the run and should only delete that.

Even with those caveats, this gives me a practical trust layer before I ship. It tells me whether the product still behaves like the product I intended to build.

Conclusion

The more I work with agents, the more I care about verification loops.

If I want agents to help me ship, I need agents to help me check what they shipped. Unit tests and integration tests are still part of that loop, but behavioral skills add something different: they make the agent use the product like a person and come back with evidence.

How to make your agents prove your product works

Write the skill around the human workflow

Examples of testing skills I run

Caveats

Conclusion

Let's keep in touch