Frameworks

The repeatable parts of working with AI tools.

When something works twice across builds, it becomes a checklist. When it works five times, it earns a name. Below are the frameworks I actually run before AI-generated code reaches a branch — not aspirational ones.

Current · stable

5-Check

A pre-merge check for AI-generated code. Trust, verify, override before merge. Five gates between "agent finished the task" and "this lands on main".

v1 · stable
AI lets you ship code faster than you can read it. The 5-Check is the part where you read it.
01

Task list

Before reviewing a diff, reconcile what was actually agreed. Did the agent stick to the brief, or did it solve an adjacent problem? If the task expanded, did it expand for a good reason? This is the cheapest place to catch scope drift.

Does the diff match the task that was given? If not, is the divergence improvement or drift?
02

Review commands

What did the agent actually run? Shell history, file writes, git ops, network calls. If it touched things you didn't expect — package managers, env files, secrets paths — that's a signal before it's a diff.

Any commands the agent ran that weren't necessary for the task?
03

Security check

Secrets in logs, expanded auth scopes, new outbound dependencies, surface area changes. AI generates plausible patterns from training data — and a lot of plausible auth code is wrong. Treat every new credential touchpoint with suspicion.

What's the new attack surface introduced by this diff? Who owns it?
04

Manual testing

Run the unhappy path by hand. Pass it a bad input, a slow network, a permission failure. AI-written code passes its own tests; manual testing is where you discover the cases the AI never considered.

What did the agent not test? Test that.
05

PR + CI review

Final pass. Read the diff like you didn't write it. Check what CI is checking, and what it isn't. If your CI doesn't catch the failure modes the agent introduced — fix CI before fixing the PR.

If this PR were submitted by a contractor, would you merge it?
Current · evolving

AI-Assisted Development Loop

The outer loop the 5-Check sits inside. Task boundaries, agent pairing, the 5-Check, and a feedback step that updates the loop itself.

v2 · evolving
01Frame the task small enough to verify
02Pair with the agent, watch its commands
03Run the 5-Check before any merge
04Merge and observe in production
05Capture what broke into the next loop

The loop is the part most teams skip. They optimize the agent and the merge gate but never close the feedback — so the same failure mode keeps slipping through. The fifth step matters more than the first.

Current · stable

Agent QA Harness

A file-based template for using AI agents as browser-based QA operators. Clone it, point the agent at your app, and get a structured test run — no test code required.

v1 · stable
The agent generates the test cases, runs them, and hands you a bug list. You didn't write a single test file.
01

Clone and onboard

Clone agent-qa-harness, open it in Claude Code or Codex, and say "start onboarding". The agent reads the harness config, asks for your app URL, and generates an initial test plan — categorised by feature area.

Does the generated test plan cover the feature areas you care about? Add any the agent missed.
02

Let the agent run

The agent executes each test case in a browser session, logging results to JSONL as it goes. Pass, fail, and observations are structured — not prose. You can watch the run or come back when it finishes.

Are any test categories failing at a rate that suggests a systemic issue, not isolated bugs?
03

Read the handoff

The agent writes a development handoff: bugs found, UX observations, and edge cases flagged during the run. It's structured so you can triage directly into your issue tracker. The PromptMate worked example produced 86 test cases across 18 categories from a single run.

Which bugs in the handoff would a human tester have caught on the first try? Those are your regression gaps.
04

Iterate the harness

After each run, update the harness config with anything the agent missed or over-tested. The config is the institutional knowledge — not the test code. New team members run "start onboarding" and inherit everything.

What did this run miss that the last run caught? Add it to the harness before closing the branch.
Drafting

What's next.

Frameworks I'm using but haven't written down yet — they'll land here when they survive another month of real work.

  • Drafting MCP server boundary checklist What belongs in the MCP layer vs. the app it wraps
  • Drafting Parallel-agent worktree pattern Running multiple Claude Code sessions on one repo, safely
  • Notes Token-budget design for agent flows Designing flows so you don't burn 10k tokens before message one
Where these came from

Frameworks live downstream of builds.

See the builds Watch on YouTube