Multi-Factory Comparison
As AI-assisted development matures, a handful of distinct "factory" patterns have emerged — systems that orchestrate multiple AI agents to plan, implement, test, and ship software with varying degrees of autonomy, structure, and human oversight. This article compares six such systems across ten operational dimensions: work tracking, persistence, quality gates, multi-agent roles, memory, handoff protocols, observability, escalation, pipeline structure, and human-in-the-loop design.
The six factories span the full spectrum from opinionated internal methodologies (Kelly) to open-source orchestration frameworks (Gas Town), from IDE-native agent platforms (Windsurf, Cursor) to CLI-first autonomous coding tools (Claude Code), GitHub-integrated agent services (GitHub Copilot), and a solo-founder AI CEO experiment (Yuki Capital). Each represents a different bet on where the tradeoffs between structure, flexibility, autonomy, and human control should land.
Kelly Router
The Kelly Router is an internal dark factory methodology implemented in the OpenClaw ecosystem. It takes product ideas through a six-stage pipeline — Intake → Research → Planning → Implementation → Testing → Release — with structured gate files between each phase. The router itself never does the work; it routes tasks to specialized lead agents (research-lead, project-lead, test-lead), which each spawn their own ephemeral sub-agents to execute in parallel. Quality gates between stages use explicit READY/NOT-READY or PASS/FAIL signals stored as text files; the final release gate requires a human SHIP or NO-SHIP decision. Kelly's TEA (Test, Evaluate, Assess) audit at the testing stage produces three possible outcomes — PASS, PASS-WITH-FOLLOWUPS, or REMEDIATE — giving the factory nuanced quality signals beyond binary go/no-go.
The factory has two speed paths: a full pipeline for new products and a quick path for features and bug fixes (which skip research). Work items are tracked in structured directories per project (research-artifacts/, planning-artifacts/, etc.) with summary gate files at each stage. Persistence across sessions relies on files written to disk (memory for long-term curated memory, daily session logs, pipeline state for machine-readable state). The Kelly gap analysis acknowledges that its file-based tracking is less powerful than Gas Town's git-versioned SQL-backed Beads, but the structure and semantics are well-developed. Kelly's RALPH (Retry And Learn Protocol) governs sub-agent failures: three attempts max, same error twice means escalate immediately.
Windsurf CUA (Cascade)
Windsurf, Codeium's AI-native IDE, centers on Cascade — a context-aware AI agent that works across an entire codebase. In 2025, Windsurf 2.0 introduced the Agent Command Center: a Kanban-style interface showing all agent sessions (local Cascade sessions and cloud Devin sessions) in a single view, grouped by status (in flight, blocked, ready for review). Work is organized into Spaces — grouping agent sessions, PRs, files, and context for a specific task or project. The Agent Command Center supports parallel agents: true concurrent execution across Git Worktrees and multi-pane/multi-tab Cascade sessions, with a dedicated terminal for reliable command execution. Cloud Devin agents generate demos and screenshots of their work for human review, and local-to-cloud handoff lets you move a session to continue offline. The changelog notes that agent sessions persist across sessions and that the setup (panes, agent tiles, keybindings) also persists.
Windsurf's multi-agent workflow is built around Cascade's Compose model: multiple specialized agents can be spawned within a project, each working on a different layer (frontend, backend, DB). The platform has a plugin marketplace for MCPs, skills, and subagents, plus private team marketplaces for internal plugins. Wave 13 (December 2025) introduced parallel agents and SWE-1.5 at standard throughput for free. Cursor 3's InfoQ review notes that Windsurf was the first IDE to market with a purpose-built agent command surface, and that its Cascade agent is repo-aware with scoped responsibilities and project context — making it better than flat autocomplete tools for coordinating multi-part changes.
Claude Code (Anthropic)
Claude Code is Anthropic's terminal/IDE-native agentic coding environment, built on the Claude Sonnet 4.5 model. It was recently upgraded with a native VS Code extension (beta), checkpoints for autonomous operation, subagents for parallel workflows, hooks for triggering actions at specific points, and background tasks for non-blocking dev servers. The key architecture is the subagent system: the main agent can spawn sub-agents that run in their own context with their own allowed tools — enabling parallel execution (e.g., building a backend API while the main agent builds the frontend). Hooks can automatically trigger actions at defined points (post-change test runs, pre-commit linting). Checkpoints automatically save code state before each change; two taps of Esc or /rewind restores prior state. The checkpoint + subagent combination is the foundation for confident autonomous operation: pursue wide-scale refactors knowing you can always rewind.
The Claude Agent SDK (formerly Claude Code SDK) exposes the same core tools, context management, and permissions frameworks for custom agentic experiences. It supports subagents and hooks natively, enabling teams to build specialized agents (financial compliance, cybersecurity, code debugging). Persistence is session-based: Claude Code stores conversation context and can read/write files locally. There is no built-in structured work-tracking database — tasks are tracked conversationally or via external tools. Quality gates are implicit: Claude Code best practices emphasize giving the agent verification criteria (tests, screenshots, expected outputs) so it can self-check. Human oversight is maintained through the checkpoint/rewind mechanism and explicit approval points for high-stakes actions. Anthropic's recommended workflow separates exploration (Plan Mode), planning, implementation, and commit — with Plan Mode preventing premature coding on poorly-understood problems.
Feature Comparison Table
| Dimension | Kelly | Gas Town | Yuki Capital | Windsurf CUA | Cursor | Claude Code | GitHub Copilot |
|---|---|---|---|---|---|---|---|
| **Work tracking** | Structured project directories + pipeline state per project | Beads (git-versioned SQL-queryable issue-trackers in Dolt) | Per-business GitHub repos; todo.md tagged by owner; decisions/ folder for institutional memory | Agent Command Center Kanban view + Spaces (grouping of sessions, PRs, files) | Agent sidebar with session tiles; `.agent.md` files; persistent pane layouts | Conversational / checkpoint-based; no built-in structured tracker; relies on external tools | GitHub Issues and PRs (native); agents panel for status |
| **Persistence** | Files on disk (memory, daily logs, pipeline state); session-based continuity | Beads in Dolt (git-versioned SQL); full work history queryable across sessions and projects | GitHub repo as persistent brain (CLAUDE.md, authority.md, decisions/, businesses/, todo.md); per-business Claude Code instances; n8n for scheduled automation | Sessions persist across editor restarts; Spaces persist project context; cloud Devin runs async | Sessions persist; cloud/local handoff preserves context | Checkpoints (code state per change); session context; SDK state | GitHub infrastructure (branches, commits, PRs); agents panel state |
| **Quality gates** | TEA audit (Test/Evaluate/Assess → PASS/PASS-WITH-FOLLOWUPS/REMEDIATE); READY/NOT-READY and PASS/FAIL gate files; 5-agent verdict adversarial review | Witness (continuous quality auditor watching all workers); multi-agent adversarial review; Bead-state quality gate results | Board reviews (quarterly); 30-day outcome reviews; mistake log for learnings; informal compared to Kelly | Implicit via Devin demo/screenshots for human review; limited formalized gate | Implicit via PR review; community patterns (e.g., copilot-orchestra) define multi-agent review steps | Self-verification via explicit test/screenshot criteria; no formal gate; checkpoint rollback on failure | Draft PR as gate; human reviews before merge; MCP-based test/lint runs pre-PR |
| **Multi-agent roles** | Named lead agents (research-lead, project-lead, test-lead); ephemeral sub-agents; router never does work | Mayor (orchestrator/control plane), Crew (named persistent agents), Polecats (ephemeral workers), Refinery, Witness, Deacon, Boot | Single persistent CEO agent (Judy Win) + per-product Claude Code agents; per-business CLAUDE.md context files; no dedicated orchestrator beyond CEO | Cascade (primary agent) + Devin (cloud); multiple parallel Cascade sessions possible | Agent-first interface; `.agent.md` defined roles; Composer 2 (frontier model) for orchestration | Main agent + sub-agents; hooks; Claude Agent SDK for custom role definitions | Copilot as coding agent; no named roles beyond the agent itself; MCP servers as extensions |
| **Memory** | memory (curated long-term), memory/YYYY-MM-DD.md (daily logs); pipeline state; narrative-rich but not machine-queryable | Beads/Dolt MEOW graph (versioned knowledge graph with typed edges); reason field captures Why per bead; SQL-queryable | GitHub repo + CLAUDE.md + decisions/ + learnings/ + per-business context files; narrative-dominant; progressive disclosure (> CLAUDE.md shrank 36% via progressive disclosure) | Session-scoped context; Spaces group project context; no long-term cross-project memory described | Session context preserved; cloud sessions accessible across devices; no cross-project memory | Checkpoint history per session; SDK state; no built-in long-term memory protocol | GitHub Issues and PR history; repository context via MCP; no cross-repository memory |
| **Handoff protocol** | Artifact directory summaries (research-summary.md, planning-summary.md, etc.) read by next-stage agent before proceeding | Beads as universal data plane; Mayor's handoff to Refinery → Polecats; explicit `gt handoff` command after every task | Per-business repo handoff; per-business CLAUDE.md read per task; todo.md tagged by owner; n8n for scheduled automated handoffs | Agent Command Center status transitions; local-to-cloud and cloud-to-local session handoff; one-click Devin handoff | Agent session tiles; drag-and-drop between panes; session handoff between local and cloud | Subagent spawn with explicit task definition and output directory; SDK handoff via structured task objects | Issue/PR assignment to Copilot; draft PR as the handoff artifact to human reviewer |
| **Observability** | pipeline state (machine), done markers (human), heartbeat (liveness), memory logs; file-based retrospective | Light Factory model: all workers visible and addressable; Dolt SQL queries for real-time state; hook age for liveness; Deacon tracks stuck workers | Screen tracking for founder's work visibility; mistake log; board review docs; GitHub commit history; n8n execution logs | Agent Command Center Kanban (in flight, blocked, needs review); Spaces give project-level view; real-time agent status | Agent sidebar showing all running agents; inline diffs in VS Code sidebar; pane-level status; screenshots/demos from cloud agents | Checkpoint listing; status line for context usage; `/rewind` to restore prior state; CLI output | Agents panel (github.com/copilot/agents) with real-time task status; PR-based review; detailed logs in GitHub Actions |
| **Escalation** | RALPH protocol: 3 attempts max, same error twice = escalate immediately; operator notified; heartbeat for stuck detection | Deacon kills stuck agents and re-queues their Beads; no structured retry-with-diagnostics (GUPP handles throughput, not retry logic) | Authority matrix escalation (decide alone → propose → founder-only); authority transfer log tracks earned autonomy progression; 30-day review cadence | Session handoff to human for blocked tasks; no formalized escalation protocol described | Human review via PR; session can be pulled back to local for hands-on debugging; no structured escalation protocol | Checkpoint rollback on failure; `/rewind` to prior state; explicit approval for high-stakes actions via `disable-model-invocation` | Assign-to-Copilot workflow handles failures via retry; human gets notified via PR review request; no structured retry protocol |
| **Pipeline structure** | Explicit six-stage: Intake → Research → Planning → Implementation → Testing → Release; quick path for features/bug fixes; full path for new products | Fluid, Mayor-driven workflow; Refinery decomposes epics into Bead sequences; no fixed stages; Rigs cycle through Crew in Bezos-style review loops | Work organized by business unit (not pipeline stage); three autonomous compounding loops in production (New AI Models 3am, Bug Autofix 6am, SEO weekly); per-product Claude Code instances | Cascade sessions as units of work; Spaces group related sessions; parallel Cascade across Git Worktrees; no fixed pipeline stages | Agent-first: agent orchestration is primary, IDE is fallback; Composer 2 orchestrates; no fixed pipeline | No fixed pipeline; recommended workflow (Explore → Plan → Implement → Commit) is advisory, not enforced | No fixed pipeline; task assigned → Copilot works → draft PR returned; human reviews and merges |
| **Human in loop** | SHIP / NO-SHIP operator decision at Release gate; TEA audit output reviewed; pipeline gate files require human/signal | Mayor filters output (reduces human reading load); NO-SHIP equivalent is implicit in Mayor editorial judgment; no formal pre-deploy human gate | Founder sets authority levels; CEO runs autonomously within authority tier; board reviews quarterly; authority transfer log for earned autonomy | Human reviews cloud agent demos/screenshots; PR review before merge; Agent Command Center for oversight; not fully autonomous | PR review before merge; local-to-cloud handoff requires human action; 35% of Cursor's own PRs are agent-authored | Checkpoint/rewind for self-correction; explicit approval hooks for high-stakes actions; Plan Mode separation for exploration | Explicit: PR is always the gate; human reviews and merges; "Project Padawan" may change this |
Summary
Most mature: Kelly and Gas Town are the most mature factories by architectural depth. Kelly has the most formally specified pipeline (six stages, named artifacts, TEA gates, RALPH escalation) — it's a methodology designed for regulated or auditable software production. Gas Town is the most fully realized open-source implementation: built, shipped, and iterated in production with 20k GitHub stars, a community, and a successor SDK (Gas City). If you want a factory that has been stress-tested with real multi-agent workloads and has the broadest conceptual coverage, Gas Town wins. If you want a factory with the most explicit process structure for human oversight, Kelly wins.
Most innovative: Gas Town is the most innovative — Beads as the universal data plane (solving "The Missing Why" problem), the MEOW knowledge graph, the Wasteland reputation economy, the Light Factory observability framing, and the 11-stage AI adoption curve are all original contributions that other factories are still absorbing. Claude Code's checkpoint system is a practical innovation for autonomous safety. Windsurf's Agent Command Center Kanban is a UX innovation that changed how agent orchestration surfaces are designed.
Simplest: GitHub Copilot Workspace (the current Copilot coding agent) is the simplest in concept: assign a task, get a PR back. It delegates all orchestration complexity to GitHub's existing infrastructure (Issues, Actions, PRs) and requires the least new learning. Claude Code is also relatively simple to adopt — it's a CLI/IDE tool that runs in your environment. The tradeoff is that simplicity on the orchestration side pushes complexity onto the human (who must define good verification criteria and manage context). Kelly and Gas Town are the most complex to set up but reduce ongoing human cognitive load through structure.
Most production-viable for enterprise: GitHub Copilot (with its agents panel and Copilot coding agent) wins on enterprise viability — it runs in GitHub's infrastructure, integrates with enterprise authentication and policies, and produces PRs that fit existing code review workflows. Kelly is the most production-viable for teams that need formal quality gates, audit trails, and human accountability without GitHub dependency.
The emerging convergence: All six factories are converging toward the same mental model: a primary orchestrator, specialized parallel workers, explicit work items, human review at key gates, and observability surfaces. The differences are in the substrate (files vs. SQL ledger vs. IDE session vs. GitHub issues), the formality of the pipeline (explicit stages vs. fluid workflow), and where the human sits in the loop (every gate vs. PR review only vs. Mayor-filtered summary). The next generation of factories will likely combine Gas Town's Beads substrate, Kelly's pipeline formality, and the IDE-native observability of Cursor 3 and Windsurf 2.0.