RALPH Protocol

Type: Agent failure escalation procedure

Definition

RALPH (Retry And Learn Protocol) is the failure handling procedure for multi-agent work in OpenClaw. Any sub-agent failure triggers an automatic retry; if the same failure happens twice in a row, escalate immediately. Three failures on any task means mandatory escalation. Unrecoverable blocks (stuck in infinite loop, data corruption, security breach) trigger immediate escalation regardless of retry count. The escalation message to the operator is structured with: project ID, phase, what failed, error description, attempt count, and recommended next steps.

How It Works

RALPH operates on a simple but effective rule: try once, try again, then escalate. The first time a sub-agent fails, the Router logs the failure and retries the task (possibly with corrected instructions). If the retry fails with the same error, that's two failures — escalate immediately. If it fails with a different error, treat it as a new failure and continue counting.

Three total failures on any task triggers mandatory escalation regardless of whether they're the same error or different ones. At this point, the Router should not continue trying — the problem requires human intervention.

For unrecoverable situations — infinite tool loops, data corruption, security issues, context overflow with no recovery path — escalate immediately without waiting for three attempts. These are situations where continued retry is likely to make things worse.

The escalation message to the operator is structured to be actionable: project ID and phase (so the operator knows where in the pipeline the failure occurred), what failed specifically, the error description with relevant details, attempt count (so the operator knows what was already tried), and recommended next steps. This makes it easy for the operator to assess and respond.

Common failure modes RALPH handles: context overflow (break into smaller chunks), agents going off-script (be more prescriptive in task instructions with explicit scope boundaries), lost results (always explicitly validate artifacts exist before marking complete), infinite tool loops (add explicit stopping conditions).

Key Properties

  • Retry once, escalate on second identical failure — same error twice = immediate escalation
  • Three failures = mandatory escalation — regardless of error type, three attempts triggers human involvement
  • Unrecoverable = immediate escalation — infinite loops, corruption, security issues skip retry count
  • Structured escalation message — project ID, phase, what failed, error, attempt count, recommended next steps
  • Router-initiated — the Router (not individual sub-agents) performs escalation
  • Context for operator — escalation includes everything needed to assess and respond
  • kelly-router — the Router applies RALPH and performs escalations
  • quality-gates — quality gate failures can trigger RALPH escalation
  • subagent-spawning — sub-agent failures are the trigger events for RALPH

Source Chapters