Multi-Factory Comparison¶

Date Compiled: 2026-04-27
As AI-assisted development matures, a handful of distinct "factory" patterns have emerged — systems that orchestrate multiple AI agents to plan, implement, test, and ship software with varying degrees of autonomy, structure, and human oversight. This article compares six such systems across ten operational dimensions: work tracking, persistence, quality gates, multi-agent roles, memory, handoff protocols, observability, escalation, pipeline structure, and human-in-the-loop design.

The six factories span the full spectrum from opinionated internal methodologies (Kelly) to open-source orchestration frameworks (Gas Town), from IDE-native agent platforms (Windsurf, Cursor) to CLI-first autonomous coding tools (Claude Code), GitHub-integrated agent services (GitHub Copilot), and a solo-founder AI CEO experiment (Yuki Capital). Each represents a different bet on where the tradeoffs between structure, flexibility, autonomy, and human control should land.

Yuki Capital AI CEO¶

Yuki Capital is a solo founder (Romain) running a Claude AI as CEO of a small holding company operating a portfolio of digital businesses (SaaS products, content sites, developer tools). The experiment began January 2026 and ran through April 2026, producing three public board reviews that document the system's evolution in detail. The AI CEO ("Judy Win") operates from a private GitHub repository that serves as persistent operational headquarters: CLAUDE.md (identity and mission), authority.md (three-tier authority matrix with explicit transfer log), decisions/ (institutional memory), todo.md (action queue tagged by owner), businesses/ (per-business folders), and a public learnings file (mistake log). The system evolved from a "smart notepad" (January) to a 24/7 operation with n8n automation and email identity (March) to a system with three autonomous compounding loops running in production (April). Work is organized by business unit, not by pipeline stage — each product has its own Claude Code instance and per-business CLAUDE.md context files. Quality gates are informal (board reviews quarterly) rather than pipeline-enforced. The most significant architectural contribution: autonomous compounding loops (models/bugs/SEO) that read their own prior outputs and improve over time — the first real-world demonstration that "an AI agent that gets better at specific tasks over time, without anyone asking" is achievable.

Kelly Router¶

The Kelly Router is an internal dark factory methodology implemented in the OpenClaw ecosystem. It takes product ideas through a six-stage pipeline — Intake → Research → Planning → Implementation → Testing → Release — with structured gate files between each phase. The router itself never does the work; it routes tasks to specialized lead agents (research-lead, project-lead, test-lead), which each spawn their own ephemeral sub-agents to execute in parallel. Quality gates between stages use explicit READY/NOT-READY or PASS/FAIL signals stored as text files; the final release gate requires a human SHIP or NO-SHIP decision. Kelly's TEA (Test, Evaluate, Assess) audit at the testing stage produces three possible outcomes — PASS, PASS-WITH-FOLLOWUPS, or REMEDIATE — giving the factory nuanced quality signals beyond binary go/no-go.

The factory has two speed paths: a full pipeline for new products and a quick path for features and bug fixes (which skip research). Work items are tracked in structured directories per project (research-artifacts/, planning-artifacts/, etc.) with summary gate files at each stage. Persistence across sessions relies on files written to disk (memory for long-term curated memory, daily session logs, pipeline state for machine-readable state). The Kelly gap analysis acknowledges that its file-based tracking is less powerful than Gas Town's git-versioned SQL-backed Beads, but the structure and semantics are well-developed. Kelly's RALPH (Retry And Learn Protocol) governs sub-agent failures: three attempts max, same error twice means escalate immediately.

Gas Town¶

Gas Town, built by Steve Yegge in early 2026, is an open-source Go-based orchestrator for Claude Code and its competitors. It implements a three-tier agent hierarchy: the Mayor (the human's chief-of-staff, reading all agent output so the human doesn't have to), Crew (named long-lived per-Rig agents with persistent context, like a PR Sheriff), and Polecats (ephemeral unmonitored workers given a Bead and let loose). Supporting roles include the Refinery (decomposes vague epics into well-specified bead sequences), Witness (quality auditor watching all workers), and Deacon (patrol daemon that kills stuck agents and re-queues their Beads). Work flows through Beads — git-versioned, SQL-queryable issue-trackers built on Dolt — which serve as the universal data plane for all work, coordination, messages, quality gates, and patrol routes. Git stores What/Where/Who/How; Beads store Why.

Persistence is Beads + Dolt: every state transition is a git commit with author, timestamp, and reason. The GUPP (Gas Town Universal Propulsion Principle) drives execution: if your hook is non-empty, you MUST run — no yielding allowed. The Wasteland extends Gas Town into a federated reputation economy: the Wanted Board lets any Rig claim public work, and multi-dimensional stamps (quality, reliability, creativity) from validators build portable reputation. Gas Town evolved into Gas City (an SDK of composable "packs" for building custom factories) and is MIT-licensed with an active Discord community. Gas City's "Light Factory" framing maximizes observability — all workers are visible and addressable, with polecats in back rooms being the only normally-invisible workers.

Windsurf CUA (Cascade)¶

Windsurf, Codeium's AI-native IDE, centers on Cascade — a context-aware AI agent that works across an entire codebase. In 2025, Windsurf 2.0 introduced the Agent Command Center: a Kanban-style interface showing all agent sessions (local Cascade sessions and cloud Devin sessions) in a single view, grouped by status (in flight, blocked, ready for review). Work is organized into Spaces — grouping agent sessions, PRs, files, and context for a specific task or project. The Agent Command Center supports parallel agents: true concurrent execution across Git Worktrees and multi-pane/multi-tab Cascade sessions, with a dedicated terminal for reliable command execution. Cloud Devin agents generate demos and screenshots of their work for human review, and local-to-cloud handoff lets you move a session to continue offline. The changelog notes that agent sessions persist across sessions and that the setup (panes, agent tiles, keybindings) also persists.

Windsurf's multi-agent workflow is built around Cascade's Compose model: multiple specialized agents can be spawned within a project, each working on a different layer (frontend, backend, DB). The platform has a plugin marketplace for MCPs, skills, and subagents, plus private team marketplaces for internal plugins. Wave 13 (December 2025) introduced parallel agents and SWE-1.5 at standard throughput for free. Cursor 3's InfoQ review notes that Windsurf was the first IDE to market with a purpose-built agent command surface, and that its Cascade agent is repo-aware with scoped responsibilities and project context — making it better than flat autocomplete tools for coordinating multi-part changes.

Cursor Agent Mode¶

Cursor, the AI-first code editor from Anysphere, launched Cursor 2.0 in October 2025 with native parallel agent support — multiple AI agents running simultaneously, each working on independent parts of a codebase. Cursor 3 (April 2026) redesigned the interface from scratch around an "agent-first" model: the primary interaction is no longer file editing but managing parallel coding agents. The interface surfaces all running agents (local and cloud) in a single sidebar, including agents kicked off from mobile, web, desktop, Slack, GitHub, or Linear. Cloud agents generate demos and screenshots of their work for review. Cursor's Composer 2 is its own frontier coding model, used for cloud execution with higher usage limits than third-party models.

Cursor defines specialized agents via .agent.md files (similar to AGENTS.md patterns), and the platform supports subagents, skills, and MCP integrations through its plugin marketplace. The changelog notes that pane layouts and agent organization persist across sessions. At Cursor's own engineering team, 35% of merged pull requests are written by autonomous cloud agents. Observability is strong: the agent sidebar shows running agents and their status, and you can drag agents into tiles for organization. Cursor supports local-to-cloud and cloud-to-local session handoff. Notably, Cursor's agent mode is proprietary and closed — the orchestration layer is not open-source. Community tools like copilot-orchestra provide multi-agent workflow patterns for Copilot within Cursor's VS Code fork.

Claude Code (Anthropic)¶

Claude Code is Anthropic's terminal/IDE-native agentic coding environment, built on the Claude Sonnet 4.5 model. It was recently upgraded with a native VS Code extension (beta), checkpoints for autonomous operation, subagents for parallel workflows, hooks for triggering actions at specific points, and background tasks for non-blocking dev servers. The key architecture is the subagent system: the main agent can spawn sub-agents that run in their own context with their own allowed tools — enabling parallel execution (e.g., building a backend API while the main agent builds the frontend). Hooks can automatically trigger actions at defined points (post-change test runs, pre-commit linting). Checkpoints automatically save code state before each change; two taps of Esc or /rewind restores prior state. The checkpoint + subagent combination is the foundation for confident autonomous operation: pursue wide-scale refactors knowing you can always rewind.

The Claude Agent SDK (formerly Claude Code SDK) exposes the same core tools, context management, and permissions frameworks for custom agentic experiences. It supports subagents and hooks natively, enabling teams to build specialized agents (financial compliance, cybersecurity, code debugging). Persistence is session-based: Claude Code stores conversation context and can read/write files locally. There is no built-in structured work-tracking database — tasks are tracked conversationally or via external tools. Quality gates are implicit: Claude Code best practices emphasize giving the agent verification criteria (tests, screenshots, expected outputs) so it can self-check. Human oversight is maintained through the checkpoint/rewind mechanism and explicit approval points for high-stakes actions. Anthropic's recommended workflow separates exploration (Plan Mode), planning, implementation, and commit — with Plan Mode preventing premature coding on poorly-understood problems.

GitHub Copilot Workspace / Agent Mode¶

GitHub Copilot Workspace — an agentic dev environment from GitHub Next — was sunset on May 30, 2025. Its core concept (plan agent captures intent, proposes a plan, implements changes) lives on in GitHub's current Copilot coding agent, launched August 2025. The Copilot coding agent is an asynchronous, autonomous developer agent: you assign it an issue or task, it works in the background in a GitHub Actions-powered environment, and returns a draft pull request. It integrates with GitHub Issues (assign an issue to Copilot), VS Code (via GitHub Pull Requests extension), JetBrains and Visual Studio (via Copilot Chat), and any MCP-enabled tool. The new agents panel on github.com (August 2025) lets you hand tasks to Copilot from any page on GitHub, with real-time status monitoring and no context-switching.

The Copilot coding agent has access to GitHub MCP (repository data), Playwright MCP (web page testing), and custom MCP servers. It can run builds, tests, and linters without asking for per-step approval. Quality is validated via the draft PR workflow — the human reviews the PR before merging. Observability is provided through the agents panel (running tasks with real-time status), detailed logs, and PR-based approvals. Work tracking is GitHub-native: issues and PRs are the primary work items, with Copilot creating branches and PRs as the output. Persistence is GitHub's infrastructure (branches, commits, PRs). Human-in-the-loop is explicit: the PR is the gating artifact — nothing ships without human review and merge. "Project Padawan" was announced as a future fully-autonomous agent for independent end-to-end task handling.

Feature Comparison Table¶

| Dimension | Kelly | Gas Town | Yuki Capital | Windsurf CUA | Cursor | Claude Code | GitHub Copilot |
|-----------|-------|----------|-------------|--------|-------------|----------------|
| Work tracking | Structured project directories + pipeline state per project | Beads (git-versioned SQL-queryable issue-trackers in Dolt) | Per-business GitHub repos; todo.md tagged by owner; decisions/ folder for institutional memory | Agent Command Center Kanban view + Spaces (grouping of sessions, PRs, files) | Agent sidebar with session tiles; .agent.md files; persistent pane layouts | Conversational / checkpoint-based; no built-in structured tracker; relies on external tools | GitHub Issues and PRs (native); agents panel for status |
| Persistence | Files on disk (memory, daily logs, pipeline state); session-based continuity | Beads in Dolt (git-versioned SQL); full work history queryable across sessions and projects | GitHub repo as persistent brain (CLAUDE.md, authority.md, decisions/, businesses/, todo.md); per-business Claude Code instances; n8n for scheduled automation | Sessions persist across editor restarts; Spaces persist project context; cloud Devin runs async | Sessions persist; cloud/local handoff preserves context | Checkpoints (code state per change); session context; SDK state | GitHub infrastructure (branches, commits, PRs); agents panel state |
| Quality gates | TEA audit (Test/Evaluate/Assess → PASS/PASS-WITH-FOLLOWUPS/REMEDIATE); READY/NOT-READY and PASS/FAIL gate files; 5-agent verdict adversarial review | Witness (continuous quality auditor watching all workers); multi-agent adversarial review; Bead-state quality gate results | Board reviews (quarterly); 30-day outcome reviews; mistake log for learnings; informal compared to Kelly | Implicit via Devin demo/screenshots for human review; limited formalized gate | Implicit via PR review; community patterns (e.g., copilot-orchestra) define multi-agent review steps | Self-verification via explicit test/screenshot criteria; no formal gate; checkpoint rollback on failure | Draft PR as gate; human reviews before merge; MCP-based test/lint runs pre-PR |
| Multi-agent roles | Named lead agents (research-lead, project-lead, test-lead); ephemeral sub-agents; router never does work | Mayor (orchestrator/control plane), Crew (named persistent agents), Polecats (ephemeral workers), Refinery, Witness, Deacon, Boot | Single persistent CEO agent (Judy Win) + per-product Claude Code agents; per-business CLAUDE.md context files; no dedicated orchestrator beyond CEO | Cascade (primary agent) + Devin (cloud); multiple parallel Cascade sessions possible | Agent-first interface; .agent.md defined roles; Composer 2 (frontier model) for orchestration | Main agent + sub-agents; hooks; Claude Agent SDK for custom role definitions | Copilot as coding agent; no named roles beyond the agent itself; MCP servers as extensions |
| Memory | memory (curated long-term), memory/YYYY-MM-DD.md (daily logs); pipeline state; narrative-rich but not machine-queryable | Beads/Dolt MEOW graph (versioned knowledge graph with typed edges); reason field captures Why per bead; SQL-queryable | GitHub repo + CLAUDE.md + decisions/ + learnings/ + per-business context files; narrative-dominant; progressive disclosure (> CLAUDE.md shrank 36% via progressive disclosure) | Session-scoped context; Spaces group project context; no long-term cross-project memory described | Session context preserved; cloud sessions accessible across devices; no cross-project memory | Checkpoint history per session; SDK state; no built-in long-term memory protocol | GitHub Issues and PR history; repository context via MCP; no cross-repository memory |
| Handoff protocol | Artifact directory summaries (research-summary.md, planning-summary.md, etc.) read by next-stage agent before proceeding | Beads as universal data plane; Mayor's handoff to Refinery → Polecats; explicit gt handoff command after every task | Per-business repo handoff; per-business CLAUDE.md read per task; todo.md tagged by owner; n8n for scheduled automated handoffs | Agent Command Center status transitions; local-to-cloud and cloud-to-local session handoff; one-click Devin handoff | Agent session tiles; drag-and-drop between panes; session handoff between local and cloud | Subagent spawn with explicit task definition and output directory; SDK handoff via structured task objects | Issue/PR assignment to Copilot; draft PR as the handoff artifact to human reviewer |
| Observability | pipeline state (machine), done markers (human), heartbeat (liveness), memory logs; file-based retrospective | Light Factory model: all workers visible and addressable; Dolt SQL queries for real-time state; hook age for liveness; Deacon tracks stuck workers | Screen tracking for founder's work visibility; mistake log; board review docs; GitHub commit history; n8n execution logs | Agent Command Center Kanban (in flight, blocked, needs review); Spaces give project-level view; real-time agent status | Agent sidebar showing all running agents; inline diffs in VS Code sidebar; pane-level status; screenshots/demos from cloud agents | Checkpoint listing; status line for context usage; /rewind to restore prior state; CLI output | Agents panel (github.com/copilot/agents) with real-time task status; PR-based review; detailed logs in GitHub Actions |
| Escalation | RALPH protocol: 3 attempts max, same error twice = escalate immediately; operator notified; heartbeat for stuck detection | Deacon kills stuck agents and re-queues their Beads; no structured retry-with-diagnostics (GUPP handles throughput, not retry logic) | Authority matrix escalation (decide alone → propose → founder-only); authority transfer log tracks earned autonomy progression; 30-day review cadence | Session handoff to human for blocked tasks; no formalized escalation protocol described | Human review via PR; session can be pulled back to local for hands-on debugging; no structured escalation protocol | Checkpoint rollback on failure; /rewind to prior state; explicit approval for high-stakes actions via disable-model-invocation | Assign-to-Copilot workflow handles failures via retry; human gets notified via PR review request; no structured retry protocol |
| Pipeline structure | Explicit six-stage: Intake → Research → Planning → Implementation → Testing → Release; quick path for features/bug fixes; full path for new products | Fluid, Mayor-driven workflow; Refinery decomposes epics into Bead sequences; no fixed stages; Rigs cycle through Crew in Bezos-style review loops | Work organized by business unit (not pipeline stage); three autonomous compounding loops in production (New AI Models 3am, Bug Autofix 6am, SEO weekly); per-product Claude Code instances | Cascade sessions as units of work; Spaces group related sessions; parallel Cascade across Git Worktrees; no fixed pipeline stages | Agent-first: agent orchestration is primary, IDE is fallback; Composer 2 orchestrates; no fixed pipeline | No fixed pipeline; recommended workflow (Explore → Plan → Implement → Commit) is advisory, not enforced | No fixed pipeline; task assigned → Copilot works → draft PR returned; human reviews and merges |
| Human in loop | SHIP / NO-SHIP operator decision at Release gate; TEA audit output reviewed; pipeline gate files require human/signal | Mayor filters output (reduces human reading load); NO-SHIP equivalent is implicit in Mayor editorial judgment; no formal pre-deploy human gate | Founder sets authority levels; CEO runs autonomously within authority tier; board reviews quarterly; authority transfer log for earned autonomy | Human reviews cloud agent demos/screenshots; PR review before merge; Agent Command Center for oversight; not fully autonomous | PR review before merge; local-to-cloud handoff requires human action; 35% of Cursor's own PRs are agent-authored | Checkpoint/rewind for self-correction; explicit approval hooks for high-stakes actions; Plan Mode separation for exploration | Explicit: PR is always the gate; human reviews and merges; "Project Padawan" may change this |

Key Findings¶

Pattern 1: Work Tracking Is Where Factories Diverge Most¶

The most significant divergence across these six factories is how they track what work exists and where it sits. Kelly uses structured project directories with gate files — functional and human-readable, but not machine-queryable across projects. Gas Town's Beads/Dolt is the most sophisticated solution: git-versioned, SQL-queryable, with typed fields that capture not just status but reason. Windsurf uses a Kanban interface as the primary work surface with Spaces for grouping. Cursor tracks work as agent sessions in a sidebar. Claude Code has no built-in work tracker at all — tasks are conversational. GitHub Copilot uses GitHub Issues and PRs as the work layer, which is both its greatest strength (familiar, enterprise-grade) and limitation (not designed for async autonomous agent workflows).

Pattern 2: Two Architectures for Persistence¶

Factories split into two persistence camps: session-plus-files (Kelly, Cursor, Claude Code, Copilot) and structured ledger (Gas Town's Beads/Dolt). Session-plus-files is simpler to deploy — Kelly's memory is just a text file — but becomes harder to query and audit as work volume grows. Structured ledgers (Dolt) enable cross-project SQL queries, immutable audit trails, and branch-based experimentation, but require running database infrastructure. Windsurf's approach is hybrid: session state persists, but the work layer (Spaces, Kanban) is more compositional than file-based.

Pattern 3: Quality Gates Range From Formal to Implicit¶

Kelly's TEA audit is the most formally specified: three phases (Test, Evaluate, Assess), three possible outcomes, a named artifact (tea-summary.md). Gas Town's Witness is continuous and daemon-based, but its output is less formally structured. Claude Code's quality model is entirely implicit — give the agent tests and it will use them, but there's no gate enforcement. GitHub Copilot's gate is the PR review, which is the most natural for developer workflows but requires human effort. Cursor and Windsurf lack formalized quality gates beyond what the human reviewer applies. The gap between Kelly's formal TEA and the implicit quality models of IDE-centric factories is significant for teams with compliance or audit requirements.

Pattern 4: Multi-Agent Roles Converge on Hierarchy¶

Despite very different implementations, most factories converge on some form of three-tier hierarchy: an orchestrator (Kelly Router, Gas Town Mayor, Cursor's Composer, Windsurf's Cascade, Claude Code main agent), specialized workers (Kelly lead agents, Gas Town Crew/Polecats, Cursor sub-agents, Claude sub-agents), and human-facing filter (Kelly's router-to-human interface, Gas Town's Mayor, Windsurf's Agent Command Center). The specific names differ but the functions are structurally similar. GitHub Copilot is the outlier — it's a single agent without named roles, relying on MCP servers for extension rather than internal role differentiation.

Pattern 5: Human-in-the-Loop Is the Biggest Differentiator¶

The factories vary enormously on how much human involvement they require and when. Kelly and GitHub Copilot are the most human-anchored: Kelly requires SHIP/NO-SHIP at release; Copilot requires PR review before merge. Gas Town is the most autonomous — the Mayor editorializes for the human but doesn't require approval gates. Claude Code's checkpoint system enables high autonomy with rollback safety. Cursor and Windsurf sit in the middle: significant autonomous capability but with PR-based human review as the safety net. This split — autonomy-first vs. human-gate-first — is the core strategic bet each factory is making.

Pattern 6: Observability Improves With Centralized Work Tracking¶

Factories with centralized work tracking (Gas Town's Dolt SQL, Windsurf's Kanban, Copilot's agents panel) have stronger observability than those with distributed file-based state (Kelly's multiple files) or conversational tracking (Claude Code). Kelly's observability is functional — you can reconstruct state from files — but requires reading multiple files. The Light Factory principle from Gas Town — all workers visible and addressable — is the aspirational model, but it requires the work tracking infrastructure to support it. Cursor 3 and Windsurf 2.0 both made major observability investments in 2026, reflecting the industry-wide recognition that agent-first interfaces need dashboard-first surfaces.

What's Missing Across All of Them¶

No factory except Kelly has a formalized adversarial multi-agent review (Kelly's 5-agent verdict / Angry Mob). Gas Town's single Witness is less robust for high-stakes quality decisions. Similarly, no factory except Kelly has an explicit pre-deployment human gate built into the pipeline structure — most rely on PR review post-implementation rather than a structured human decision point before release. Cross-factory work migration remains unsolved: Beads/Dolt is promising but not adopted by other factories. And long-term memory across projects is uniformly weak — Kelly's narrative memory is the most developed, while others rely on session-scoped or repository-scoped context only.

Summary¶

Most mature: Kelly and Gas Town are the most mature factories by architectural depth. Kelly has the most formally specified pipeline (six stages, named artifacts, TEA gates, RALPH escalation) — it's a methodology designed for regulated or auditable software production. Gas Town is the most fully realized open-source implementation: built, shipped, and iterated in production with 20k GitHub stars, a community, and a successor SDK (Gas City). If you want a factory that has been stress-tested with real multi-agent workloads and has the broadest conceptual coverage, Gas Town wins. If you want a factory with the most explicit process structure for human oversight, Kelly wins.

Most innovative: Gas Town is the most innovative — Beads as the universal data plane (solving "The Missing Why" problem), the MEOW knowledge graph, the Wasteland reputation economy, the Light Factory observability framing, and the 11-stage AI adoption curve are all original contributions that other factories are still absorbing. Claude Code's checkpoint system is a practical innovation for autonomous safety. Windsurf's Agent Command Center Kanban is a UX innovation that changed how agent orchestration surfaces are designed.

Simplest: GitHub Copilot Workspace (the current Copilot coding agent) is the simplest in concept: assign a task, get a PR back. It delegates all orchestration complexity to GitHub's existing infrastructure (Issues, Actions, PRs) and requires the least new learning. Claude Code is also relatively simple to adopt — it's a CLI/IDE tool that runs in your environment. The tradeoff is that simplicity on the orchestration side pushes complexity onto the human (who must define good verification criteria and manage context). Kelly and Gas Town are the most complex to set up but reduce ongoing human cognitive load through structure.

Most production-viable for enterprise: GitHub Copilot (with its agents panel and Copilot coding agent) wins on enterprise viability — it runs in GitHub's infrastructure, integrates with enterprise authentication and policies, and produces PRs that fit existing code review workflows. Kelly is the most production-viable for teams that need formal quality gates, audit trails, and human accountability without GitHub dependency.

The emerging convergence: All six factories are converging toward the same mental model: a primary orchestrator, specialized parallel workers, explicit work items, human review at key gates, and observability surfaces. The differences are in the substrate (files vs. SQL ledger vs. IDE session vs. GitHub issues), the formality of the pipeline (explicit stages vs. fluid workflow), and where the human sits in the loop (every gate vs. PR review only vs. Mayor-filtered summary). The next generation of factories will likely combine Gas Town's Beads substrate, Kelly's pipeline formality, and the IDE-native observability of Cursor 3 and Windsurf 2.0.

kelly-factory-overview, kelly-gas-town-gap-analysis, steve-yegge-gas-town, yukicapital-board-review-2, yukicapital-board-review-3, yukicapital-the-agentic-economy