← All posts

I Built a Bug-Hunting Loop That Doesn't Quit: The BugBot Methodology

Every code review I've ever done was a single pass. You open the diff, read through it, maybe catch a few things, leave some comments, approve. The problem is bugs don't care about your review flow. T

  • ai-agents
  • code-review
  • debugging
  • claude-code
  • developer-tools
  • testing

Every code review I’ve ever done was a single pass. You open the diff, read through it, maybe catch a few things, leave some comments, approve. The problem is bugs don’t care about your review flow. They hide in the interactions between features, at the edges of data types, in the paths that only fire under load. One pass at one angle misses them.

I wanted to build something different — a code review tool that attacks the same codebase from every angle, iterates until it genuinely finds nothing new, and doesn’t stop just because the first pass looked clean. I called it BugBot.

BugBot: Adversarial Iterative Code Review

The Core Idea: Loop Until Clean

BugBot is built around a simple premise: keep reviewing until a complete pass finds zero new CRITICAL or HIGH severity issues. Not “until you’ve looked at the diff once.” Until you’ve exhausted every meaningful attack angle and the code genuinely holds up.

Under the hood, it runs inside a persistent execution loop that spawns a fresh agent context each iteration, fed by a shared state file on disk. Each iteration picks a new set of attack angles, runs parallel review agents, fixes what it finds, then decides: keep going, or declare clean.

# .claude/review-state.md (after iteration 2)

## Trigger Coverage
- [x] Simple path    → Iteration 1: angles 1, 5
- [x] Complex path   → Iteration 1: angle 6
- [x] Boundary       → Iteration 2: angle 22
- [ ] Error recovery
- [x] Stress/volume  → Iteration 2: angle 23
- [x] Interaction    → Iteration 1: angles 1, 2, 3
- [ ] Configuration

The state file is the only memory across iterations. The loop starts fresh each time, reads the file, picks untried angles, and continues.

Standing on the Shoulders of Giants

I didn’t invent any of this from scratch. BugBot is a synthesis of techniques from across the industry:

SourceTechnique Borrowed
Cursor BugBotParallel passes with shuffled file ordering + majority voting for consensus
Meta ACHFault-aware test generation — every fix gets a test that would have caught it
Trail of BitsEvidence-based findings with exact file:line citations — no hallucinated bugs
IBM ODCTrigger coverage tracking — completeness measured by which trigger categories are covered
HP Defect OriginsMissing/Wrong/Unclear lens applied to every code element
CERTSeverity × Likelihood × Remediation Cost prioritization
SmartBear/Cisco<400 LOC per agent focus, <60 minutes per pass
MAGIS (NeurIPS)Developer ↔ QA feedback loop — rejection feeds back as context for next attempt
EllipsisMulti-stage filtering: confidence threshold → dedup → hallucination check

The insight from studying all these tools is that they each attack different failure modes. No single technique dominates — you need all of them.

Seven Triggers for Completeness

One of the most useful frameworks I borrowed was IBM’s Orthogonal Defect Classification — specifically the concept of ODC triggers. These are the conditions that cause bugs to manifest:

TriggerDescription
Simple pathHappy path, normal inputs
Complex pathMulti-step flows, conditional branches
BoundaryEdge values, empty/null/max
Error recoveryWhat happens when things fail?
Stress/volumeHigh load, large data, rapid interaction
InteractionCross-feature, cross-component effects
ConfigurationDifferent settings, roles, environments

BugBot won’t declare ALL_CLEAN until all seven triggers have been exercised. This is the key difference from “we reviewed the diff.” It’s not about lines read — it’s about which failure modes you’ve actually tested.

Confidence Scoring Every Finding

Not all bugs are equal. Not all bug reports are equally credible. BugBot requires every finding to be scored on two axes before it gets acted on:

C3 ConfirmedC2 ProbableC1 Possible
S3 CriticalCRITICAL — fix before mergeHIGH — fix before mergeMEDIUM
S2 ModerateHIGH — fix before mergeMEDIUMLOW
S1 MinorMEDIUMLOWINFO

Only CRITICAL and HIGH findings block the ALL_CLEAN promise. This stops the tool from generating a wall of low-confidence noise and forcing you to fix speculative issues before shipping.

The evidence requirement is strict: every finding needs an exact file:line, a code snippet, a trigger scenario, and a Missing/Wrong/Unclear classification. No handwavy “this might be a bug.” Concrete or it’s downgraded.

The Attack Angle Catalog

Each iteration picks 4–5 untried angles from a catalog of 28. The catalog covers seven categories:

  • Cross-feature interactions — what other features read data this one writes?
  • Data integrity — round-trip consistency, NULL vs empty vs missing, type coercions
  • Client-server contract — response shape, validation mismatch, error path UX
  • Security & safety — XSS, authorization gaps, audit trail completeness
  • Template & display — missing variables, i18n, accessibility
  • Edge cases & stress — empty state, max capacity, concurrent editing
  • Ecosystem impact — search index, exports, API consumers

Critically, agents review files in a shuffled order — different per agent. Borrowed from Cursor BugBot’s research showing that reading order affects which patterns an agent notices first.

What a Typical Run Looks Like

Iteration 1: angles 1, 5, 6  →  2 HIGH bugs found  →  fix + regression tests  →  continue
Iteration 2: angles 3, 9, 12 →  0 CRITICAL/HIGH    →  5/7 triggers covered   →  continue
Iteration 3: angles 11, 14, 16 → 0 CRITICAL/HIGH   →  7/7 triggers covered   →  ALL_CLEAN

Three iterations. Two bugs fixed. Two regression tests written. The loop does the work.

What I Learned

Building BugBot taught me that completeness is the hard problem in code review, not thoroughness on the happy path. Most reviews are thorough on the obvious path. They fall down at interaction effects, error handling, and configuration edge cases — exactly the places bugs hide in production.

The ODC trigger framework gives you a way to know when you’re done. Not “when you feel done” — when specific failure mode categories have been exercised. That’s a meaningful standard.


Part 2 covers what happened when I ran this in practice: why fresh context per iteration turned out to be a feature, how consensus signals emerged from independent agents, and the lessons I’d apply to any code review process — with or without a tool.