← Back to KB Index
Multi-Factory Comparison
multi-factory-comparison.md

Multi-Factory Comparison

As AI-assisted development matures, a handful of distinct "factory" patterns have emerged — systems that orchestrate multiple AI agents to plan, implement, test, and ship software with varying degrees of autonomy, structure, and human oversight. This article compares six such systems across ten operational dimensions: work tracking, persistence, quality gates, multi-agent roles, memory, handoff protocols, observability, escalation, pipeline structure, and human-in-the-loop design.

The six factories span the full spectrum from opinionated internal methodologies (Kelly) to open-source orchestration frameworks (Gas Town), from IDE-native agent platforms (Windsurf, Cursor) to CLI-first autonomous coding tools (Claude Code), GitHub-integrated agent services (GitHub Copilot), and a solo-founder AI CEO experiment (Yuki Capital). Each represents a different bet on where the tradeoffs between structure, flexibility, autonomy, and human control should land.

Kelly Router

The Kelly Router is an internal dark factory methodology implemented in the OpenClaw ecosystem. It takes product ideas through a six-stage pipeline — Intake → Research → Planning → Implementation → Testing → Release — with structured gate files between each phase. The router itself never does the work; it routes tasks to specialized lead agents (research-lead, project-lead, test-lead), which each spawn their own ephemeral sub-agents to execute in parallel. Quality gates between stages use explicit READY/NOT-READY or PASS/FAIL signals stored as text files; the final release gate requires a human SHIP or NO-SHIP decision. Kelly's TEA (Test, Evaluate, Assess) audit at the testing stage produces three possible outcomes — PASS, PASS-WITH-FOLLOWUPS, or REMEDIATE — giving the factory nuanced quality signals beyond binary go/no-go.

The factory has two speed paths: a full pipeline for new products and a quick path for features and bug fixes (which skip research). Work items are tracked in structured directories per project (research-artifacts/, planning-artifacts/, etc.) with summary gate files at each stage. Persistence across sessions relies on files written to disk (memory for long-term curated memory, daily session logs, pipeline state for machine-readable state). The Kelly gap analysis acknowledges that its file-based tracking is less powerful than Gas Town's git-versioned SQL-backed Beads, but the structure and semantics are well-developed. Kelly's RALPH (Retry And Learn Protocol) governs sub-agent failures: three attempts max, same error twice means escalate immediately.

Windsurf CUA (Cascade)

Windsurf, Codeium's AI-native IDE, centers on Cascade — a context-aware AI agent that works across an entire codebase. In 2025, Windsurf 2.0 introduced the Agent Command Center: a Kanban-style interface showing all agent sessions (local Cascade sessions and cloud Devin sessions) in a single view, grouped by status (in flight, blocked, ready for review). Work is organized into Spaces — grouping agent sessions, PRs, files, and context for a specific task or project. The Agent Command Center supports parallel agents: true concurrent execution across Git Worktrees and multi-pane/multi-tab Cascade sessions, with a dedicated terminal for reliable command execution. Cloud Devin agents generate demos and screenshots of their work for human review, and local-to-cloud handoff lets you move a session to continue offline. The changelog notes that agent sessions persist across sessions and that the setup (panes, agent tiles, keybindings) also persists.

Windsurf's multi-agent workflow is built around Cascade's Compose model: multiple specialized agents can be spawned within a project, each working on a different layer (frontend, backend, DB). The platform has a plugin marketplace for MCPs, skills, and subagents, plus private team marketplaces for internal plugins. Wave 13 (December 2025) introduced parallel agents and SWE-1.5 at standard throughput for free. Cursor 3's InfoQ review notes that Windsurf was the first IDE to market with a purpose-built agent command surface, and that its Cascade agent is repo-aware with scoped responsibilities and project context — making it better than flat autocomplete tools for coordinating multi-part changes.

Claude Code (Anthropic)

Claude Code is Anthropic's terminal/IDE-native agentic coding environment, built on the Claude Sonnet 4.5 model. It was recently upgraded with a native VS Code extension (beta), checkpoints for autonomous operation, subagents for parallel workflows, hooks for triggering actions at specific points, and background tasks for non-blocking dev servers. The key architecture is the subagent system: the main agent can spawn sub-agents that run in their own context with their own allowed tools — enabling parallel execution (e.g., building a backend API while the main agent builds the frontend). Hooks can automatically trigger actions at defined points (post-change test runs, pre-commit linting). Checkpoints automatically save code state before each change; two taps of Esc or /rewind restores prior state. The checkpoint + subagent combination is the foundation for confident autonomous operation: pursue wide-scale refactors knowing you can always rewind.

The Claude Agent SDK (formerly Claude Code SDK) exposes the same core tools, context management, and permissions frameworks for custom agentic experiences. It supports subagents and hooks natively, enabling teams to build specialized agents (financial compliance, cybersecurity, code debugging). Persistence is session-based: Claude Code stores conversation context and can read/write files locally. There is no built-in structured work-tracking database — tasks are tracked conversationally or via external tools. Quality gates are implicit: Claude Code best practices emphasize giving the agent verification criteria (tests, screenshots, expected outputs) so it can self-check. Human oversight is maintained through the checkpoint/rewind mechanism and explicit approval points for high-stakes actions. Anthropic's recommended workflow separates exploration (Plan Mode), planning, implementation, and commit — with Plan Mode preventing premature coding on poorly-understood problems.

Feature Comparison Table

DimensionKellyGas TownYuki CapitalWindsurf CUACursorClaude CodeGitHub Copilot
**Work tracking**Structured project directories + pipeline state per projectBeads (git-versioned SQL-queryable issue-trackers in Dolt)Per-business GitHub repos; todo.md tagged by owner; decisions/ folder for institutional memoryAgent Command Center Kanban view + Spaces (grouping of sessions, PRs, files)Agent sidebar with session tiles; `.agent.md` files; persistent pane layoutsConversational / checkpoint-based; no built-in structured tracker; relies on external toolsGitHub Issues and PRs (native); agents panel for status
**Persistence**Files on disk (memory, daily logs, pipeline state); session-based continuityBeads in Dolt (git-versioned SQL); full work history queryable across sessions and projectsGitHub repo as persistent brain (CLAUDE.md, authority.md, decisions/, businesses/, todo.md); per-business Claude Code instances; n8n for scheduled automationSessions persist across editor restarts; Spaces persist project context; cloud Devin runs asyncSessions persist; cloud/local handoff preserves contextCheckpoints (code state per change); session context; SDK stateGitHub infrastructure (branches, commits, PRs); agents panel state
**Quality gates**TEA audit (Test/Evaluate/Assess → PASS/PASS-WITH-FOLLOWUPS/REMEDIATE); READY/NOT-READY and PASS/FAIL gate files; 5-agent verdict adversarial reviewWitness (continuous quality auditor watching all workers); multi-agent adversarial review; Bead-state quality gate resultsBoard reviews (quarterly); 30-day outcome reviews; mistake log for learnings; informal compared to KellyImplicit via Devin demo/screenshots for human review; limited formalized gateImplicit via PR review; community patterns (e.g., copilot-orchestra) define multi-agent review stepsSelf-verification via explicit test/screenshot criteria; no formal gate; checkpoint rollback on failureDraft PR as gate; human reviews before merge; MCP-based test/lint runs pre-PR
**Multi-agent roles**Named lead agents (research-lead, project-lead, test-lead); ephemeral sub-agents; router never does workMayor (orchestrator/control plane), Crew (named persistent agents), Polecats (ephemeral workers), Refinery, Witness, Deacon, BootSingle persistent CEO agent (Judy Win) + per-product Claude Code agents; per-business CLAUDE.md context files; no dedicated orchestrator beyond CEOCascade (primary agent) + Devin (cloud); multiple parallel Cascade sessions possibleAgent-first interface; `.agent.md` defined roles; Composer 2 (frontier model) for orchestrationMain agent + sub-agents; hooks; Claude Agent SDK for custom role definitionsCopilot as coding agent; no named roles beyond the agent itself; MCP servers as extensions
**Memory**memory (curated long-term), memory/YYYY-MM-DD.md (daily logs); pipeline state; narrative-rich but not machine-queryableBeads/Dolt MEOW graph (versioned knowledge graph with typed edges); reason field captures Why per bead; SQL-queryableGitHub repo + CLAUDE.md + decisions/ + learnings/ + per-business context files; narrative-dominant; progressive disclosure (> CLAUDE.md shrank 36% via progressive disclosure)Session-scoped context; Spaces group project context; no long-term cross-project memory describedSession context preserved; cloud sessions accessible across devices; no cross-project memoryCheckpoint history per session; SDK state; no built-in long-term memory protocolGitHub Issues and PR history; repository context via MCP; no cross-repository memory
**Handoff protocol**Artifact directory summaries (research-summary.md, planning-summary.md, etc.) read by next-stage agent before proceedingBeads as universal data plane; Mayor's handoff to Refinery → Polecats; explicit `gt handoff` command after every taskPer-business repo handoff; per-business CLAUDE.md read per task; todo.md tagged by owner; n8n for scheduled automated handoffsAgent Command Center status transitions; local-to-cloud and cloud-to-local session handoff; one-click Devin handoffAgent session tiles; drag-and-drop between panes; session handoff between local and cloudSubagent spawn with explicit task definition and output directory; SDK handoff via structured task objectsIssue/PR assignment to Copilot; draft PR as the handoff artifact to human reviewer
**Observability**pipeline state (machine), done markers (human), heartbeat (liveness), memory logs; file-based retrospectiveLight Factory model: all workers visible and addressable; Dolt SQL queries for real-time state; hook age for liveness; Deacon tracks stuck workersScreen tracking for founder's work visibility; mistake log; board review docs; GitHub commit history; n8n execution logsAgent Command Center Kanban (in flight, blocked, needs review); Spaces give project-level view; real-time agent statusAgent sidebar showing all running agents; inline diffs in VS Code sidebar; pane-level status; screenshots/demos from cloud agentsCheckpoint listing; status line for context usage; `/rewind` to restore prior state; CLI outputAgents panel (github.com/copilot/agents) with real-time task status; PR-based review; detailed logs in GitHub Actions
**Escalation**RALPH protocol: 3 attempts max, same error twice = escalate immediately; operator notified; heartbeat for stuck detectionDeacon kills stuck agents and re-queues their Beads; no structured retry-with-diagnostics (GUPP handles throughput, not retry logic)Authority matrix escalation (decide alone → propose → founder-only); authority transfer log tracks earned autonomy progression; 30-day review cadenceSession handoff to human for blocked tasks; no formalized escalation protocol describedHuman review via PR; session can be pulled back to local for hands-on debugging; no structured escalation protocolCheckpoint rollback on failure; `/rewind` to prior state; explicit approval for high-stakes actions via `disable-model-invocation`Assign-to-Copilot workflow handles failures via retry; human gets notified via PR review request; no structured retry protocol
**Pipeline structure**Explicit six-stage: Intake → Research → Planning → Implementation → Testing → Release; quick path for features/bug fixes; full path for new productsFluid, Mayor-driven workflow; Refinery decomposes epics into Bead sequences; no fixed stages; Rigs cycle through Crew in Bezos-style review loopsWork organized by business unit (not pipeline stage); three autonomous compounding loops in production (New AI Models 3am, Bug Autofix 6am, SEO weekly); per-product Claude Code instancesCascade sessions as units of work; Spaces group related sessions; parallel Cascade across Git Worktrees; no fixed pipeline stagesAgent-first: agent orchestration is primary, IDE is fallback; Composer 2 orchestrates; no fixed pipelineNo fixed pipeline; recommended workflow (Explore → Plan → Implement → Commit) is advisory, not enforcedNo fixed pipeline; task assigned → Copilot works → draft PR returned; human reviews and merges
**Human in loop**SHIP / NO-SHIP operator decision at Release gate; TEA audit output reviewed; pipeline gate files require human/signalMayor filters output (reduces human reading load); NO-SHIP equivalent is implicit in Mayor editorial judgment; no formal pre-deploy human gateFounder sets authority levels; CEO runs autonomously within authority tier; board reviews quarterly; authority transfer log for earned autonomyHuman reviews cloud agent demos/screenshots; PR review before merge; Agent Command Center for oversight; not fully autonomousPR review before merge; local-to-cloud handoff requires human action; 35% of Cursor's own PRs are agent-authoredCheckpoint/rewind for self-correction; explicit approval hooks for high-stakes actions; Plan Mode separation for explorationExplicit: PR is always the gate; human reviews and merges; "Project Padawan" may change this

Summary

Most mature: Kelly and Gas Town are the most mature factories by architectural depth. Kelly has the most formally specified pipeline (six stages, named artifacts, TEA gates, RALPH escalation) — it's a methodology designed for regulated or auditable software production. Gas Town is the most fully realized open-source implementation: built, shipped, and iterated in production with 20k GitHub stars, a community, and a successor SDK (Gas City). If you want a factory that has been stress-tested with real multi-agent workloads and has the broadest conceptual coverage, Gas Town wins. If you want a factory with the most explicit process structure for human oversight, Kelly wins.

Most innovative: Gas Town is the most innovative — Beads as the universal data plane (solving "The Missing Why" problem), the MEOW knowledge graph, the Wasteland reputation economy, the Light Factory observability framing, and the 11-stage AI adoption curve are all original contributions that other factories are still absorbing. Claude Code's checkpoint system is a practical innovation for autonomous safety. Windsurf's Agent Command Center Kanban is a UX innovation that changed how agent orchestration surfaces are designed.

Simplest: GitHub Copilot Workspace (the current Copilot coding agent) is the simplest in concept: assign a task, get a PR back. It delegates all orchestration complexity to GitHub's existing infrastructure (Issues, Actions, PRs) and requires the least new learning. Claude Code is also relatively simple to adopt — it's a CLI/IDE tool that runs in your environment. The tradeoff is that simplicity on the orchestration side pushes complexity onto the human (who must define good verification criteria and manage context). Kelly and Gas Town are the most complex to set up but reduce ongoing human cognitive load through structure.

Most production-viable for enterprise: GitHub Copilot (with its agents panel and Copilot coding agent) wins on enterprise viability — it runs in GitHub's infrastructure, integrates with enterprise authentication and policies, and produces PRs that fit existing code review workflows. Kelly is the most production-viable for teams that need formal quality gates, audit trails, and human accountability without GitHub dependency.

The emerging convergence: All six factories are converging toward the same mental model: a primary orchestrator, specialized parallel workers, explicit work items, human review at key gates, and observability surfaces. The differences are in the substrate (files vs. SQL ledger vs. IDE session vs. GitHub issues), the formality of the pipeline (explicit stages vs. fluid workflow), and where the human sits in the loop (every gate vs. PR review only vs. Mayor-filtered summary). The next generation of factories will likely combine Gas Town's Beads substrate, Kelly's pipeline formality, and the IDE-native observability of Cursor 3 and Windsurf 2.0.