Chapter 15: Troubleshooting & Optimization
The most common failures in rough order: Gateway not running (nothing works; check with openclaw gateway status and start with openclaw gateway start; prevent with LaunchAgent/systemd autostart), wrong file paths ("No such file or directory" errors; use find to locate), cron running at wrong time (Gateway often runs in UTC; convert local time to UTC in cron expressions), context overflow in long agents (irrelevant responses, forgotten instructions; spawn fresh agents for long tasks), sub-agents not reporting back (check subagents: list for session status), web_fetch returning empty (content loaded by JavaScript; switch to browser tool), edit failing due to whitespace mismatch (read file fresh, use cat -A to see invisible characters), and permission denied (check ls -la, chmod 644 for files, chmod 755 for scripts, chmod 600 for secrets). Debugging uses binary search through pipeline steps, log analysis with tail and grep -i error, manual reproduction of failing tasks, and adding set -x or verbose logging temporarily.
Performance tuning focuses on measuring first: add time to scripts to identify slow steps. Common culprits are web requests without timeouts (add timeout=10 to avoid indefinite hangs), reading very large files unnecessarily, and sequential operations that could be parallel (spawn sub-agents for parallel fetches). Context loading overhead affects sessions that load large SOUL.md, MEMORY.md, AGENTS.md—keep these concise and do periodic MEMORY.md archival to data/ files. Sub-agent efficiency: use sub-agents for well-defined isolated tasks where parallelism or isolation matters; for simple single tool calls, use the main agent directly. Cost optimization targets input token reduction (keep context files concise; use cheaper models like Haiku for routine cron tasks with model override in cron config), output token reduction (instruct agents to be terse for automated tasks that write to files rather than users), and smart scheduling (heartbeat every 2 hours vs. 15 minutes is an 8x cost difference for similar coverage).
Security hardening follows the principle of least privilege: tool policy with security: "allowlist" in openclaw.json specifying exactly which commands are permitted; elevated: false unless explicitly needed. Secrets management requires credentials never appear in task descriptions, SOUL.md, logs, or conversation history—store in /secrets/ with chmod 700 on the directory and chmod 600 on files. Review automation that affects third parties (client emails, customer data, financial transactions) with drafts-only patterns—never automate final send without human approval. Monitor system behavior by occasionally auditing cron-history.log and outbound-messages.log for unexpected patterns.
The decay problem—websites change, APIs update, processes evolve—requires mitigation through failure alerts (not just success), monthly audits of what's actually working, and "last successful run" timestamps in logs. Documentation debt is prevented by writing at creation time, maintaining a CHANGELOG, and documenting failures and fixes in MEMORY.md when they occur. Version compatibility requires pinning model versions when stability matters over capability, testing after OpenClaw updates before production use, and keeping working configs in git for rollback. Monthly health check cron runs a full audit: all cron jobs ran, disk usage, stale files in data/, scripts still exist and are executable, with findings reported to the operator.
Key Patterns
- **Binary search debugging:** Run middle step first; halve the range each iteration to find the failing step fast
- **Cost optimization via model tiering:** Use cheap models (Haiku) for routine cron; Sonnet for complex tasks
- **Secrets never in context:** Store credentials in /secrets/*.json with chmod 600; read at runtime by path
- **Least-privilege tool policy:** `security: "allowlist"` with explicit allowedCommands; no elevated unless necessary
- **Monthly health audit cron:** Full system self-assessment—cron history, disk, stale files, script existence—reported monthly
Related Concepts
- [[kelly-handbook-ch14-designing-stack]] for designing systems that are maintainable from the start
- [[kelly-handbook-ch6-cron]] for scheduling the monthly health audit and understanding cron failure modes
- [[kelly-handbook-ch3-file-automation]] for the file operation troubleshooting (paths, permissions, edit failures)
- [[kelly-handbook-ch10-browser]] for the web_fetch vs. browser debugging decision (JS-loaded content needs browser)