Chapter 7: Multi-Agent Orchestration
A single agent has hard limits: a finite context window, sequential-only work, no specialization, and no self-check when it goes wrong. Multi-agent systems address all of these—sub-agents work in parallel, each specialized, with agents validating each other's work and work broken into context-sized chunks. The Kelly Router pattern is the reference architecture: the main agent (Router) never does the work itself, only routes to specialized leads (research-lead, project-lead, test-lead), validates quality gates, and communicates with the operator. Each lead is an orchestrator that spawns parallel sub-agents for its domain. The main agent reads AGENTS.md at session start to understand its routing rules, named agents, intake procedures, quality gates, and escalation protocol.
The subagents tool spawns independent worker agents: spawn creates a labeled sub-agent with a task description; list shows running agents; steer sends additional instructions to a running agent; kill stops one. The key power move is parallel spawning—three research agents can simultaneously cover three competitors in 5 minutes instead of 15 sequentially. All sub-agents run independently and report back when complete. AGENTS.md is the operating manual that defines the router's role, named agent configurations (their triggers, capabilities, and output directories), routing rules (which agent gets which task type), explicit quality gate criteria, and the RALPH escalation protocol. A minimal quality gate before marking any task complete checks: output file exists, file is non-empty, content addresses the original request.
RALPH (Retry And Learn Protocol) handles failures: any sub-agent failure triggers a retry; same failure twice → escalate immediately; three failures on any task → mandatory escalation; unrecoverable blocks → immediate escalation. The escalation message to the operator is structured with project ID, phase, what failed, error description, attempt count, and recommended next steps. Beyond RALPH, common failure modes include context overflow (break tasks into smaller chunks), agents going off-script (be more prescriptive in task instructions with explicit scope boundaries), lost results (always explicitly validate artifacts exist, don't trust "I'm done"), and infinite tool loops (add explicit stopping conditions: "research until you have 3+ credible sources or 30 minutes elapsed").
Key Patterns
- **Kelly Router architecture:** Main agent routes, validates gates, escalates; never does the work itself
- **Named leads with spawn/steer/kill:** research-lead, project-lead, test-lead spawn parallel sub-agents for their domain
- **AGENTS.md as operating manual:** Routing rules, named agents, quality gates, RALPH protocol, memory protocol
- **Parallel sub-agent spawning:** Three simultaneous research agents covering three competitors = 5x speedup
- **Structured handoffs with gate validation:** Each phase produces named artifacts; receiving agent checks gate before starting
Related Concepts
- [[kelly-handbook-ch2-architecture]] for session architecture and the subagents tool in context
- [[kelly-handbook-ch8-memory]] for memory protocol and daily logs that multi-agent work writes to
- [[kelly-tweets-agents]] for practical multi-agent patterns from the Kelly Twitter corpus
- [[kelly-gas-town-gap-analysis]] for factory architecture comparison including multi-agent patterns