Appendix H: Extended Troubleshooting Reference

kelly-handbook-appendix-h-troubleshooting.md

id	kelly-handbook-appendix-h-troubleshooting
type	handbook
source	Kelly handbook (automate-everything-openclaw-handbook)
author	Kelly Claude AI
date	2026-04-27

Appendix H: Extended Troubleshooting Reference

This appendix is a comprehensive troubleshooting index covering every significant failure mode in OpenClaw automation — organized by system component and searchable by symptom. Rather than theoretical explanations, it provides diagnosis steps and concrete fixes for real problems practitioners encounter. Bookmark this appendix; it's the first place to look when something breaks.

Gateway issues cover the most common startup and runtime failures: port conflicts (diagnosed via lsof -i :3147), JSON config syntax errors (validated with python3 -m json.tool), missing workspace directories, permission errors, and the "runs but doesn't process requests" scenario. For the latter, the diagnostic path checks actual port listening (lsof), connectivity (curl http://localhost:3147/health), and channel configuration in openclaw.json.

Tool failures section covers the most frequently encountered tool-specific issues: read returning empty even when files exist (binary files, 0-byte files, encoding issues, trailing whitespace in paths), exec failing with "command not found" despite working in terminal (exec's restricted PATH requires full paths or explicit PATH env), web_fetch returning stale content (cache headers respected, add cache-busting parameters or use browser tool), browser snapshot returning empty (check browser started and explicit navigate before snapshot), and message send failures (channel config, recipient format, international number prefixes).

Cron job failures get deep treatment: "runs but agent does nothing" requires adding instrumentation with explicit log writes after each step, "runs multiple times" addresses DST/system restart duplication with idempotent state-file patterns, "works manually but not on schedule" is nine times out of ten an environment variable or PATH issue (fixable by simulating cron environment with env -i), and time-based failures from DST transitions, missing calendar days, and leap years requiring more robust schedule expressions.

Agent behavior issues cover the full range: asking for clarification instead of acting (fix: explicit inputs and numbered steps), formatting output wrong (fix: explicit format specs), doing extra work (fix: scope boundaries), forgetting instructions in long sessions (fix: repeat critical constraints, add to SOUL.md hard rules), and hallucinating wrong information (fix: ask to look things up rather than know things). The file system section provides systematic approach to the perennial "files not where I expect them" problem: exact path check, find by name, find recently created, search by content.

Performance issues section provides actual profiling approaches rather than guessing: time for script timing, cProfile for Python profiling, and specific fixes for common slow patterns (sequential HTTP in a loop → ThreadPoolExecutor, reading config file inside loop → read once, subprocess overhead in loop → Python json module, large context files → audit and compact).

Key Items

**Gateway "runs but doesn't process" diagnostic path** — Check actual listening (`lsof -i :3147`), test connectivity (`curl http://localhost:3147/health`), verify channel configuration in openclaw.json; systematic elimination approach identifies the exact failure point
**exec "command not found" fix** — exec has restricted PATH; find actual command location with `which` in terminal, then use full paths in exec or inject PATH via env parameter; this is the most common cron-script failure cause
**Cron multi-run idempotency pattern** — DST/system restarts cause duplicate runs; use state file tracking last run date per job to skip if already ran today; prevents duplicate messages, double processing, and state corruption from repeated runs
**"Agent does nothing" debugging** — Add explicit logging after every step: write timestamp + step name + status to /clawd/logs/cron-results.log; if nothing written, agent crashed or task had immediate error; if COMPLETE but no output, agent wrote to unexpected location
**Performance profiling** — Don't guess what's slow; use `time` command for shell scripts, `python3 -m cProfile -s cumulative` for Python scripts; common fixes: parallelize sequential HTTP (ThreadPoolExecutor), read config once not in loop, use Python json not subprocess calls

Related Concepts

[[kelly-handbook-ch5-error-handling]] for error handling patterns
[[kelly-handbook-ch6-scheduling-and-cron]] for cron scheduling reference
[[kelly-handbook-ch15-troubleshooting]] for troubleshooting chapter
[[karpathy-llm-wiki]] for the LLM wiki pattern this KB follows