OpenClaw Auto Review Skill¶
Type: Agent skill specification
Source: GitHub — openclaw/agent-skills
Date: 2026-06-04
What It Is¶
A structured code review system that runs as a closeout check before commit or ship. This is code review, not approval routing — the output is advisory, never authoritative. Codex is the default engine, Claude is optional, and multi-reviewer panels require explicit opt-in.
Review Modes¶
| Mode | Flag | When to Use |
|---|---|---|
| Local | --mode local |
Dirty unstaged/staged/untracked changes in current checkout |
| Branch | --mode branch --base origin/main |
PR or branch diff against a base |
| Commit | --mode commit --commit HEAD |
Already-landed or pushed single changes |
The helper auto-detects the best mode: checks for dirty local changes first, then current PR base via gh pr view, then falls back to origin/main.
The Contract (Core Rules)¶
- Advisory output — Never blindly apply review findings. Verify every finding by reading the real code path and adjacent files.
- Reject speculative risks — Unrealistic edge cases, broad rewrites, and over-complicated fixes are rejected outright.
- Small targeted fixes — Prefer fixes at the right ownership boundary. No refactors unless they clearly improve the bug class.
- Fix bug classes, not individual bugs — When a finding shows a repeated pattern, inspect the current PR scope for sibling instances and fix them all at once.
- Keep going until clean — Structured review must return no accepted/actionable findings before stopping.
- Rerun after changes — If a review-triggered fix changes code, rerun focused tests and the structured review helper.
Security Integration¶
Security perspective is always included in reviews. However, it should not cripple legitimate functionality. Report security findings only when:
- The change creates a concrete, actionable risk
- The change removes an important safety check
Suppressed findings stay in structured output; active output keeps an unsuppressible suppression notice. Aggregate findings cannot hide unrelated active risk.
Regression Provenance¶
When reviewing code, the skill tracks blame across multiple roles:
- Blamed code author
- Blamed PR author
- PR merger/committer
- Current PR author
- PR/date metadata
If no blamed PR is traceable, the blamed commit becomes the provenance (SHA, date, author). If the blamed PR was merged by automation (clawsweeper[bot]), the skill identifies the human trigger when practical — looking for maintainer commands like @clawsweeper automerge or /landpr.
Multi-Model Review¶
Codex is the default and should remain the normal final closeout engine. Claude is available as an alternative.
Panel mode (opt-in):
autoreview --panel # Codex + Claude
autoreview --reviewers codex,claude --model codex=gpt-5.1 --thinking codex=high
Panels run multiple reviewers against one frozen bundle. The main agent still verifies every accepted finding before fixing. Panels are used when explicitly requested or when risk justifies the extra spend.
Parallel Closeout¶
Tests and review can run in parallel:
autoreview --parallel-tests "pytest tests/ -x"
Tradeoff: tests may force code changes that stale the review. If either leads to edits, rerun the affected side until no accepted findings remain.
The Helper CLI¶
The autoreview helper manages:
- Target selection (auto-detects dirty/branch/commit)
- Engine selection (codex, claude, droid, copilot)
- Structured validation of engine output
- Parallel test execution
- Exit status (0 = clean, nonzero = findings present)
- Heartbeat lines for long-running reviews (up to 30 minutes)
Key behaviors:
- Writes to stdout unless --output or --json-output is set
- Supports --dry-run, --prompt-file, --dataset, --no-tools, --no-web-search
- Allows read-only tools and web search by default
- Forbids nested review from inside the review
- Prints autoreview clean when exit 0 with no findings
Final Report Requirements¶
Every review must include:
1. Review command used
2. Tests/proof run
3. Findings accepted/rejected (briefly why)
4. The clean review result from the final run, or why a remaining finding was consciously rejected
Relevance to Factory Patterns¶
This skill defines a concrete pattern for advisory code review as a pipeline closeout step. Key takeaways for factory design:
- Review is advisory, not gating — it informs decisions without blocking them
- Verify then fix — every finding requires code-path verification before action
- Bug-class thinking — find the pattern, fix the class, stop at boundaries
- Multi-model as opt-in — additional reviewers are a cost/risk tradeoff, not a default
- Structured output — the helper enforces a contract on review results (accepted/rejected/clean)
Factory Pipeline Evaluation (Dark Factory)¶
Our Current Pipeline¶
Router → Carson (research) → Amelia (build) → QA (test) → Phil (deploy)
QA subphases in web-build:
- 6.1-deploy-qa — Deploy to QA env, CSP audit, Playwright console errors, DOM snapshot, story cross-check, interaction tests, screenshot
- 6.2-qa-review — Review all test results, verify stories complete, check design match, QA report
- 6.3-adversarial-qa — BMAD cynical outsider review (edge cases, failure modes, assumptions)
- 6.4-django-admin-qa — Django admin verification (when applicable)
Existing quality gates: bash scripts in factory/scripts/gates/ that check integration-report.json status, db smoke tests, API 200s. Git hooks: pre-commit file size check. Amelia's BUILD gate: 8/10 code quality score.
Q1: Could autoreview replace or augment QA?¶
Augment, not replace. The QA agent does browser-based functional testing (CSP, Playwright, DOM, interactions). Autoreview does static code analysis. They operate on different axes:
| Concern | Current QA | Autoreview |
|---|---|---|
| Browser rendering | ✅ Playwright DOM snapshot | ❌ |
| CSP/security headers | ✅ csp-audit.sh | ⚠️ Code-level only |
| Functional bugs | ✅ Interaction tests | ❌ |
| Code quality | ❌ Manual review in 6.2 | ✅ Structured findings |
| Logic errors | ❌ | ✅ |
| Security vulnerabilities | ❌ (BMAD 6.3 does this) | ✅ Always included |
| Regression tracing | ❌ | ✅ Provenance tracking |
| Edge case detection | ⚠️ BMAD adversarial | ✅ Per-file analysis |
Recommendation: Insert autoreview as subphase 6.0-closeout-review between BUILD completion and QA entry. This catches code-level issues before the QA agent spends time on browser testing. The QA agent then validates behavior, not code structure.
Q2: Would it work with LLM-written code?¶
Yes — this is actually the ideal use case. The autoreview contract doesn't care who wrote the code. It's advisory output that the consuming agent (Amelia, QA, or the Router) verifies before acting.
Key fit with LLM agent output:
- LLMs produce pattern-based bugs — The "fix bug classes, not individual bugs" principle directly addresses LLM tendencies (e.g., always forgetting error handling, always using the same anti-pattern)
- Advisory = self-correction loop — Amelia could run autoreview on her own output as a pre-commit check, fixing findings before the DONE marker
- No approval bottleneck — Advisory output means no human-in-the-loop for code review; the agent verifies and acts autonomously
- LLM review of LLM code — This is meta but valid. Different models catch different things (Claude vs Codex reasoning patterns)
Q3: Codex engine compatibility¶
Codex CLI is not installed on this machine. claude CLI is available at /opt/homebrew/bin/claude.
Options:
- Use --engine claude — Claude CLI works, just not the default engine
- Install Codex — Would need npm install -g @openai/codex or equivalent
- Use --engine copilot — If GitHub Copilot CLI is available
Recommendation: Start with --engine claude since it's already installed. Codex is preferred but not required. The structured validation schema is engine-agnostic.
Q4: Setup effort¶
Minimal. The autoreview helper is a single Python script (~600 lines) with these dependencies:
- git (installed ✅)
- gh (GitHub CLI — check if installed)
- One review engine CLI (claude ✅)
- Python 3 (installed ✅)
No need to install the full OpenClaw agent-skills repo. We can:
1. Clone just the skills/autoreview/scripts/autoreview file
2. Or adapt the pattern: build a lightweight wrapper that calls claude with the structured review prompt and validates the JSON schema output
Pattern adaptation is cleaner than full adoption. The core value is:
1. Build a git diff bundle (local/branch/commit modes)
2. Send to an LLM with the structured review prompt
3. Validate output against the JSON schema
4. Return findings with exit status
This is ~100 lines of Python to implement from scratch using the claude CLI.
Q5: Overlap with existing tooling¶
| Existing Tool | What It Does | Autoreview Overlap |
|---|---|---|
| QA agent (6.2) | Test result review, story completion | Low — different scope (behavior vs code) |
| BMAD adversarial (6.3) | Cynical outsider review | Medium — both find bugs, but BMAD is broader |
| lint.py | KB article linting (frontmatter, links) | None — different domain |
| Gate scripts | Integration smoke tests, status checks | None — different layer |
| Git pre-commit hooks | File size check | Low — autoreview is deeper |
| Amelia's self-test | Agent writes + tests code | Complementary — autoreview is a second opinion |
Biggest overlap: BMAD adversarial (6.3). Both find edge cases and failure modes. But autoreview is code-focused and structured (JSON output with file paths and line numbers), while BMAD is broader and more narrative. They could stack: autoreview for code-level findings, BMAD for system-level assumptions.
Integration Proposal¶
BUILD (Amelia) → [6.0-closeout-review] → QA (6.1-6.4) → DEPLOY (Phil)
↑
autoreview --engine claude
mode: branch --base origin/main
parallel-tests: "cd backend && python manage.py test"
The closeout review runs on the branch diff, catches code-level issues, and the QA agent then focuses on browser-based validation. If autoreview finds Critical/Major issues, the gate blocks QA entry and sends findings back to Amelia.
Related¶
- advisory-code-review — The advisory review pattern
- review-as-closeout-check — Review as pipeline closeout step
- multi-model-review — Multi-model review panels
- regression-provenance — Blame tracking across PRs and commits
- quality-gates — Factory quality gate patterns
- ralph-protocol — Retry and learn on failure
- three-questions-test — Does autoreview pass the delegation test?
Bibliography¶
- OpenClaw agent-skills repository: https://github.com/openclaw/agent-skills
- autoreview SKILL.md: https://github.com/openclaw/agent-skills/blob/main/skills/autoreview/SKILL.md
- autoreview helper script: https://github.com/openclaw/agent-skills/blob/main/skills/autoreview/scripts/autoreview