Multi-Factory Comparison

As AI-assisted development matures, a handful of distinct "factory" patterns have emerged — systems that orchestrate multiple AI agents to plan, implement, test, and ship software with varying degrees of autonomy, structure, and human oversight. This article compares six such systems across ten operational dimensions: work tracking, persistence, quality gates, multi-agent roles, memory, handoff protocols, observability, escalation, pipeline structure, and human-in-the-loop design.

The six factories span the full spectrum from opinionated internal methodologies (Kelly) to open-source orchestration frameworks (Gas Town), from IDE-native agent platforms (Windsurf, Cursor) to CLI-first autonomous coding tools (Claude Code), GitHub-integrated agent services (GitHub Copilot), and a solo-founder AI CEO experiment (Yuki Capital). Each represents a different bet on where the tradeoffs between structure, flexibility, autonomy, and human control should land.

Kelly Router

The Kelly Router is an internal dark factory methodology implemented in the OpenClaw ecosystem. It takes product ideas through a six-stage pipeline — Intake → Research → Planning → Implementation → Testing → Release — with structured gate files between each phase. The router itself never does the work; it routes tasks to specialized lead agents (research-lead, project-lead, test-lead), which each spawn their own ephemeral sub-agents to execute in parallel. Quality gates between stages use explicit READY/NOT-READY or PASS/FAIL signals stored as text files; the final release gate requires a human SHIP or NO-SHIP decision. Kelly's TEA (Test, Evaluate, Assess) audit at the testing stage produces three possible outcomes — PASS, PASS-WITH-FOLLOWUPS, or REMEDIATE — giving the factory nuanced quality signals beyond binary go/no-go.

The factory has two speed paths: a full pipeline for new products and a quick path for features and bug fixes (which skip research). Work items are tracked in structured directories per project (research-artifacts/, planning-artifacts/, etc.) with summary gate files at each stage. Persistence across sessions relies on files written to disk (memory for long-term curated memory, daily session logs, pipeline state for machine-readable state). The Kelly gap analysis acknowledges that its file-based tracking is less powerful than Gas Town's git-versioned SQL-backed Beads, but the structure and semantics are well-developed. Kelly's RALPH (Retry And Learn Protocol) governs sub-agent failures: three attempts max, same error twice means escalate immediately.

Windsurf CUA (Cascade)

Windsurf, Codeium's AI-native IDE, centers on Cascade — a context-aware AI agent that works across an entire codebase. In 2025, Windsurf 2.0 introduced the Agent Command Center: a Kanban-style interface showing all agent sessions (local Cascade sessions and cloud Devin sessions) in a single view, grouped by status (in flight, blocked, ready for review). Work is organized into Spaces — grouping agent sessions, PRs, files, and context for a specific task or project. The Agent Command Center supports parallel agents: true concurrent execution across Git Worktrees and multi-pane/multi-tab Cascade sessions, with a dedicated terminal for reliable command execution. Cloud Devin agents generate demos and screenshots of their work for human review, and local-to-cloud handoff lets you move a session to continue offline. The changelog notes that agent sessions persist across sessions and that the setup (panes, agent tiles, keybindings) also persists.

Windsurf's multi-agent workflow is built around Cascade's Compose model: multiple specialized agents can be spawned within a project, each working on a different layer (frontend, backend, DB). The platform has a plugin marketplace for MCPs, skills, and subagents, plus private team marketplaces for internal plugins. Wave 13 (December 2025) introduced parallel agents and SWE-1.5 at standard throughput for free. Cursor 3's InfoQ review notes that Windsurf was the first IDE to market with a purpose-built agent command surface, and that its Cascade agent is repo-aware with scoped responsibilities and project context — making it better than flat autocomplete tools for coordinating multi-part changes.

Claude Code (Anthropic)

Claude Code is Anthropic's terminal/IDE-native agentic coding environment, built on the Claude Sonnet 4.5 model. It was recently upgraded with a native VS Code extension (beta), checkpoints for autonomous operation, subagents for parallel workflows, hooks for triggering actions at specific points, and background tasks for non-blocking dev servers. The key architecture is the subagent system: the main agent can spawn sub-agents that run in their own context with their own allowed tools — enabling parallel execution (e.g., building a backend API while the main agent builds the frontend). Hooks can automatically trigger actions at defined points (post-change test runs, pre-commit linting). Checkpoints automatically save code state before each change; two taps of Esc or /rewind restores prior state. The checkpoint + subagent combination is the foundation for confident autonomous operation: pursue wide-scale refactors knowing you can always rewind.

The Claude Agent SDK (formerly Claude Code SDK) exposes the same core tools, context management, and permissions frameworks for custom agentic experiences. It supports subagents and hooks natively, enabling teams to build specialized agents (financial compliance, cybersecurity, code debugging). Persistence is session-based: Claude Code stores conversation context and can read/write files locally. There is no built-in structured work-tracking database — tasks are tracked conversationally or via external tools. Quality gates are implicit: Claude Code best practices emphasize giving the agent verification criteria (tests, screenshots, expected outputs) so it can self-check. Human oversight is maintained through the checkpoint/rewind mechanism and explicit approval points for high-stakes actions. Anthropic's recommended workflow separates exploration (Plan Mode), planning, implementation, and commit — with Plan Mode preventing premature coding on poorly-understood problems.

Feature Comparison Table

Dimension	Kelly	Gas Town	Yuki Capital	Windsurf CUA	Cursor	Claude Code	GitHub Copilot
Work tracking	Structured project directories + pipeline state per project	Beads (git-versioned SQL-queryable issue-trackers in Dolt)	Per-business GitHub repos; todo.md tagged by owner; decisions/ folder for institutional memory	Agent Command Center Kanban view + Spaces (grouping of sessions, PRs, files)	Agent sidebar with session tiles; `.agent.md` files; persistent pane layouts	Conversational / checkpoint-based; no built-in structured tracker; relies on external tools	GitHub Issues and PRs (native); agents panel for status
Persistence	Files on disk (memory, daily logs, pipeline state); session-based continuity	Beads in Dolt (git-versioned SQL); full work history queryable across sessions and projects	GitHub repo as persistent brain (CLAUDE.md, authority.md, decisions/, businesses/, todo.md); per-business Claude Code instances; n8n for scheduled automation	Sessions persist across editor restarts; Spaces persist project context; cloud Devin runs async	Sessions persist; cloud/local handoff preserves context	Checkpoints (code state per change); session context; SDK state	GitHub infrastructure (branches, commits, PRs); agents panel state
Quality gates	TEA audit (Test/Evaluate/Assess → PASS/PASS-WITH-FOLLOWUPS/REMEDIATE); READY/NOT-READY and PASS/FAIL gate files; 5-agent verdict adversarial review	Witness (continuous quality auditor watching all workers); multi-agent adversarial review; Bead-state quality gate results	Board reviews (quarterly); 30-day outcome reviews; mistake log for learnings; informal compared to Kelly	Implicit via Devin demo/screenshots for human review; limited formalized gate	Implicit via PR review; community patterns (e.g., copilot-orchestra) define multi-agent review steps	Self-verification via explicit test/screenshot criteria; no formal gate; checkpoint rollback on failure	Draft PR as gate; human reviews before merge; MCP-based test/lint runs pre-PR
Multi-agent roles	Named lead agents (research-lead, project-lead, test-lead); ephemeral sub-agents; router never does work	Mayor (orchestrator/control plane), Crew (named persistent agents), Polecats (ephemeral workers), Refinery, Witness, Deacon, Boot	Single persistent CEO agent (Judy Win) + per-product Claude Code agents; per-business CLAUDE.md context files; no dedicated orchestrator beyond CEO	Cascade (primary agent) + Devin (cloud); multiple parallel Cascade sessions possible	Agent-first interface; `.agent.md` defined roles; Composer 2 (frontier model) for orchestration	Main agent + sub-agents; hooks; Claude Agent SDK for custom role definitions	Copilot as coding agent; no named roles beyond the agent itself; MCP servers as extensions
Memory	memory (curated long-term), memory/YYYY-MM-DD.md (daily logs); pipeline state; narrative-rich but not machine-queryable	Beads/Dolt MEOW graph (versioned knowledge graph with typed edges); reason field captures Why per bead; SQL-queryable	GitHub repo + CLAUDE.md + decisions/ + learnings/ + per-business context files; narrative-dominant; progressive disclosure (> CLAUDE.md shrank 36% via progressive disclosure)	Session-scoped context; Spaces group project context; no long-term cross-project memory described	Session context preserved; cloud sessions accessible across devices; no cross-project memory	Checkpoint history per session; SDK state; no built-in long-term memory protocol	GitHub Issues and PR history; repository context via MCP; no cross-repository memory
Handoff protocol	Artifact directory summaries (research-summary.md, planning-summary.md, etc.) read by next-stage agent before proceeding	Beads as universal data plane; Mayor's handoff to Refinery → Polecats; explicit `gt handoff` command after every task	Per-business repo handoff; per-business CLAUDE.md read per task; todo.md tagged by owner; n8n for scheduled automated handoffs	Agent Command Center status transitions; local-to-cloud and cloud-to-local session handoff; one-click Devin handoff	Agent session tiles; drag-and-drop between panes; session handoff between local and cloud	Subagent spawn with explicit task definition and output directory; SDK handoff via structured task objects	Issue/PR assignment to Copilot; draft PR as the handoff artifact to human reviewer
Observability	pipeline state (machine), done markers (human), heartbeat (liveness), memory logs; file-based retrospective	Light Factory model: all workers visible and addressable; Dolt SQL queries for real-time state; hook age for liveness; Deacon tracks stuck workers	Screen tracking for founder's work visibility; mistake log; board review docs; GitHub commit history; n8n execution logs	Agent Command Center Kanban (in flight, blocked, needs review); Spaces give project-level view; real-time agent status	Agent sidebar showing all running agents; inline diffs in VS Code sidebar; pane-level status; screenshots/demos from cloud agents	Checkpoint listing; status line for context usage; `/rewind` to restore prior state; CLI output	Agents panel (github.com/copilot/agents) with real-time task status; PR-based review; detailed logs in GitHub Actions
Escalation	RALPH protocol: 3 attempts max, same error twice = escalate immediately; operator notified; heartbeat for stuck detection	Deacon kills stuck agents and re-queues their Beads; no structured retry-with-diagnostics (GUPP handles throughput, not retry logic)	Authority matrix escalation (decide alone → propose → founder-only); authority transfer log tracks earned autonomy progression; 30-day review cadence	Session handoff to human for blocked tasks; no formalized escalation protocol described	Human review via PR; session can be pulled back to local for hands-on debugging; no structured escalation protocol	Checkpoint rollback on failure; `/rewind` to prior state; explicit approval for high-stakes actions via `disable-model-invocation`	Assign-to-Copilot workflow handles failures via retry; human gets notified via PR review request; no structured retry protocol
Pipeline structure	Explicit six-stage: Intake → Research → Planning → Implementation → Testing → Release; quick path for features/bug fixes; full path for new products	Fluid, Mayor-driven workflow; Refinery decomposes epics into Bead sequences; no fixed stages; Rigs cycle through Crew in Bezos-style review loops	Work organized by business unit (not pipeline stage); three autonomous compounding loops in production (New AI Models 3am, Bug Autofix 6am, SEO weekly); per-product Claude Code instances	Cascade sessions as units of work; Spaces group related sessions; parallel Cascade across Git Worktrees; no fixed pipeline stages	Agent-first: agent orchestration is primary, IDE is fallback; Composer 2 orchestrates; no fixed pipeline	No fixed pipeline; recommended workflow (Explore → Plan → Implement → Commit) is advisory, not enforced	No fixed pipeline; task assigned → Copilot works → draft PR returned; human reviews and merges
Human in loop	SHIP / NO-SHIP operator decision at Release gate; TEA audit output reviewed; pipeline gate files require human/signal	Mayor filters output (reduces human reading load); NO-SHIP equivalent is implicit in Mayor editorial judgment; no formal pre-deploy human gate	Founder sets authority levels; CEO runs autonomously within authority tier; board reviews quarterly; authority transfer log for earned autonomy	Human reviews cloud agent demos/screenshots; PR review before merge; Agent Command Center for oversight; not fully autonomous	PR review before merge; local-to-cloud handoff requires human action; 35% of Cursor's own PRs are agent-authored	Checkpoint/rewind for self-correction; explicit approval hooks for high-stakes actions; Plan Mode separation for exploration	Explicit: PR is always the gate; human reviews and merges; "Project Padawan" may change this

Summary

Most mature: Kelly and Gas Town are the most mature factories by architectural depth. Kelly has the most formally specified pipeline (six stages, named artifacts, TEA gates, RALPH escalation) — it's a methodology designed for regulated or auditable software production. Gas Town is the most fully realized open-source implementation: built, shipped, and iterated in production with 20k GitHub stars, a community, and a successor SDK (Gas City). If you want a factory that has been stress-tested with real multi-agent workloads and has the broadest conceptual coverage, Gas Town wins. If you want a factory with the most explicit process structure for human oversight, Kelly wins.

Most innovative: Gas Town is the most innovative — Beads as the universal data plane (solving "The Missing Why" problem), the MEOW knowledge graph, the Wasteland reputation economy, the Light Factory observability framing, and the 11-stage AI adoption curve are all original contributions that other factories are still absorbing. Claude Code's checkpoint system is a practical innovation for autonomous safety. Windsurf's Agent Command Center Kanban is a UX innovation that changed how agent orchestration surfaces are designed.

Simplest: GitHub Copilot Workspace (the current Copilot coding agent) is the simplest in concept: assign a task, get a PR back. It delegates all orchestration complexity to GitHub's existing infrastructure (Issues, Actions, PRs) and requires the least new learning. Claude Code is also relatively simple to adopt — it's a CLI/IDE tool that runs in your environment. The tradeoff is that simplicity on the orchestration side pushes complexity onto the human (who must define good verification criteria and manage context). Kelly and Gas Town are the most complex to set up but reduce ongoing human cognitive load through structure.

Most production-viable for enterprise: GitHub Copilot (with its agents panel and Copilot coding agent) wins on enterprise viability — it runs in GitHub's infrastructure, integrates with enterprise authentication and policies, and produces PRs that fit existing code review workflows. Kelly is the most production-viable for teams that need formal quality gates, audit trails, and human accountability without GitHub dependency.

The emerging convergence: All six factories are converging toward the same mental model: a primary orchestrator, specialized parallel workers, explicit work items, human review at key gates, and observability surfaces. The differences are in the substrate (files vs. SQL ledger vs. IDE session vs. GitHub issues), the formality of the pipeline (explicit stages vs. fluid workflow), and where the human sits in the loop (every gate vs. PR review only vs. Mayor-filtered summary). The next generation of factories will likely combine Gas Town's Beads substrate, Kelly's pipeline formality, and the IDE-native observability of Cursor 3 and Windsurf 2.0.