When something works twice across builds, it becomes a checklist. When it works five times, it earns a name. Below are the frameworks I actually run before AI-generated code reaches a branch — not aspirational ones.
Current · stable
5-Check
A pre-merge check for AI-generated code. Trust, verify, override before merge. Five gates between "agent finished the task" and "this lands on main".
v1 · stable
AI lets you ship code faster than you can read it. The 5-Check is the part where you read it.
01
Task list
Before reviewing a diff, reconcile what was actually agreed. Did the agent stick to the brief, or did it solve an adjacent problem? If the task expanded, did it expand for a good reason? This is the cheapest place to catch scope drift.
Does the diff match the task that was given? If not, is the divergence improvement or drift?
02
Review commands
What did the agent actually run? Shell history, file writes, git ops, network calls. If it touched things you didn't expect — package managers, env files, secrets paths — that's a signal before it's a diff.
Any commands the agent ran that weren't necessary for the task?
03
Security check
Secrets in logs, expanded auth scopes, new outbound dependencies, surface area changes. AI generates plausible patterns from training data — and a lot of plausible auth code is wrong. Treat every new credential touchpoint with suspicion.
What's the new attack surface introduced by this diff? Who owns it?
04
Manual testing
Run the unhappy path by hand. Pass it a bad input, a slow network, a permission failure. AI-written code passes its own tests; manual testing is where you discover the cases the AI never considered.
What did the agent not test? Test that.
05
PR + CI review
Final pass. Read the diff like you didn't write it. Check what CI is checking, and what it isn't. If your CI doesn't catch the failure modes the agent introduced — fix CI before fixing the PR.
If this PR were submitted by a contractor, would you merge it?
Current · evolving
AI-Assisted Development Loop
The outer loop the 5-Check sits inside. Task boundaries, agent pairing, the 5-Check, and a feedback step that updates the loop itself.
v2 · evolving
01Frame the task small enough to verify
02Pair with the agent, watch its commands
03Run the 5-Check before any merge
04Merge and observe in production
05Capture what broke into the next loop
The loop is the part most teams skip. They optimize the agent and the merge gate but never close the feedback — so the same failure mode keeps slipping through. The fifth step matters more than the first.
Current · stable
Agent QA Harness
A file-based template for using AI agents as browser-based QA operators. Clone it, point the agent at your app, and get a structured test run — no test code required.
v1 · stable
The agent generates the test cases, runs them, and hands you a bug list. You didn't write a single test file.
01
Clone and onboard
Clone agent-qa-harness, open it in Claude Code or Codex, and say "start onboarding". The agent reads the harness config, asks for your app URL, and generates an initial test plan — categorised by feature area.
Does the generated test plan cover the feature areas you care about? Add any the agent missed.
02
Let the agent run
The agent executes each test case in a browser session, logging results to JSONL as it goes. Pass, fail, and observations are structured — not prose. You can watch the run or come back when it finishes.
Are any test categories failing at a rate that suggests a systemic issue, not isolated bugs?
03
Read the handoff
The agent writes a development handoff: bugs found, UX observations, and edge cases flagged during the run. It's structured so you can triage directly into your issue tracker. The PromptMate worked example produced 86 test cases across 18 categories from a single run.
Which bugs in the handoff would a human tester have caught on the first try? Those are your regression gaps.
04
Iterate the harness
After each run, update the harness config with anything the agent missed or over-tested. The config is the institutional knowledge — not the test code. New team members run "start onboarding" and inherit everything.
What did this run miss that the last run caught? Add it to the harness before closing the branch.
Drafting
What's next.
Frameworks I'm using but haven't written down yet — they'll land here when they survive another month of real work.
DraftingMCP server boundary checklistWhat belongs in the MCP layer vs. the app it wraps
DraftingParallel-agent worktree patternRunning multiple Claude Code sessions on one repo, safely
NotesToken-budget design for agent flowsDesigning flows so you don't burn 10k tokens before message one