RALPH Refinements — Lessons from Production Use

Type: Operational pattern
Related: kelly-handbook-ch7-multi-agent, kelly-handbook-ch15-troubleshooting


What RALPH Is

RALPH is the factory's error handling protocol: Retry → Ask → Log → Pause → Handoff. When a subagent fails, the Router follows this escalation ladder:

  1. Retry — up to 3 attempts, with diagnostics passed between retries
  2. Ask — if retries exhaust, escalate to the operator
  3. Log — record the failure with log-failure.sh
  4. Pause — stop the pipeline at the failed step
  5. Handoff — let the operator decide next steps

The protocol itself hasn't changed. These refinements are about how RALPH is applied — the edge cases that production use exposed.

Refinement 1: QA Agents Must Test CRUD, Not Just READ

The Problem

QA agents were testing whether pages loaded (READ) but never testing whether data could be created, updated, or deleted (CREATE, UPDATE, DELETE). The QA gate passed, but the application was broken for any write operation.

This went undetected because:
- QA agents defaulted to GET requests — the easiest thing to test
- No workflow template specified "test all CRUD operations"
- The quality gate checked "did QA run?" not "did QA test writes?"

The Fix

Every QA workflow now explicitly requires testing all CRUD operations:

  • CREATE — submit a form, verify data appears
  • READ — load the page, verify data displays
  • UPDATE — edit an existing record, verify changes persist
  • DELETE — remove a record, verify it's gone

The QA agent's task description must include: "Test all CRUD operations — not just page loads."

Why This Matters for RALPH

If QA only tests READ, it will pass broken applications. RALPH will never trigger because the failure isn't caught until the user tries to write data. By the time the user reports it, the pipeline has already moved on and the context is lost.

Lesson: RALPH can only catch failures that QA actually tests for. Expand QA scope before relying on RALPH to catch bugs.

Refinement 2: Lessons Must Go into Agent Skill Files

The Problem

When the Router learned a lesson (e.g., "always test CRUD"), it wrote it to SELF_IMPROVEMENT.md or memory files. But sub-agents don't read the Router's memory. They read their own AGENTS.md, their skill files, and the task description. Lessons in the Router's notes never propagated to the agents that needed them.

A lesson learned by the Router in session 1 was forgotten by session 3's sub-agent.

The Fix

Lessons that apply to specific agents must go into those agents' skill files or AGENTS.md. The Router's memory is for the Router. Agent-specific knowledge belongs in agent-specific files.

Before: Router writes "QA must test CRUD" to memory. Next QA spawn doesn't test CRUD.

After: "QA must test CRUD" goes into the QA agent's AGENTS.md or task template. Every QA spawn reads it.

The Propagation Path

Knowledge Type Where It Goes
Router operational lessons Router's AGENTS.md or memory
Agent-specific behavior That agent's AGENTS.md
Workflow-level rules Workflow markdown file (in factory/workflows/)
Cross-agent patterns Factory-level AGENTS.md or this KB

Lesson: RALPH fixes the immediate failure. Skill file updates prevent the next one. Both are required.

Refinement 3: Parallel Pipelines Confirmed Working

The Question

Does RALPH work when multiple pipelines run concurrently? If two pipelines both hit failures at the same time, does the Router handle both correctly?

The Answer: Yes

The April 2026 validation run tested this directly — test-web-run and factory-dashboard-rebuild ran simultaneously with three agents (amelia, testlead, phil). Both pipelines had steps that needed retries. RALPH handled them independently:

  • Each pipeline has its own bead tracking — failures are pipeline-scoped
  • RALPH retries are step-scoped — failing step 4.2 in pipeline A doesn't affect step 3.1 in pipeline B
  • The Router processes completion events sequentially — no race conditions

What We Learned

  • Token usage varies wildly by step type — Python scripts use ~15-30K tokens per step; TypeScript security scans use ~270K. Budget accordingly for parallel runs.
  • Total runtime was ~1.5 hours for two full pipelines (SCAFFOLD through DEPLOY). Parallel execution saved roughly 40% wall-clock time vs sequential.
  • QA catches what build misses — in factory-dashboard-rebuild, QA added 22 new tests and caught a CSS class name bug (in_progress/closed vs active/complete/pending). The build agent never noticed.

Refinement 4: Never Spin on the Same Error

The Rule

From the factory's AGENTS.md: "Never spin if same error 2x in a row, escalate immediately."

This is a RALPH refinement that prevents wasted retries. If attempt 1 fails with error X, and attempt 2 fails with the same error X, don't try attempt 3 with the same inputs. Escalate.

How It Works

Between retries, the Router collects diagnostics:
- The error message
- The output produced so far
- Any partial artifacts

These diagnostics are passed to the next retry attempt so it can learn from the failure. But if the diagnostics show the same error twice, the Router skips retry 3 and goes straight to escalation.

Refinement 5: Failure Type Matters

Not all failures are equal. RALPH handles them differently:

Failure Type RALPH Response
Context overflow Break work into smaller chunks, retry
Lost results (agent said done but no output) Respawn — don't ask, just respawn
Infinite tool loop Add stopping conditions, retry
Agent bug (wrong output) Log + escalate — the agent needs fixing
Timeout Respawn immediately — don't wait

The key insight: respawn-first for transient failures, escalate-first for persistent failures. RALPH's job is to distinguish between the two.

Summary of Refinements

Refinement What Changed
CRUD testing QA must test all operations, not just READ
Lesson propagation Agent-specific knowledge → agent-specific files
Parallel pipelines Confirmed RALPH works across concurrent runs
Same-error detection Skip retry 3 if error 2 matches error 1
Failure type routing Different failure types get different RALPH responses

None of these change RALPH's core protocol. They're all about making the protocol work correctly in the messy reality of production pipelines.


These refinements were discovered between March and May 2026, across 10+ pipeline runs. The CRUD gap alone was responsible for two production bugs before it was identified and fixed.

Concept Cross-References