Appendix F: The Practitioner's Field Guide
This appendix is organized around questions that come up during real implementation work — the "I'm in the middle of building this and I'm stuck" questions that theory chapters don't answer. It covers the entire lifecycle: getting started when you don't know what to automate first, building pipelines that actually work, keeping a running system healthy, and scaling up as your automation practice matures. This is the appendix experienced practitioners will return to most often.
The "getting started" section addresses the most common paralysis points: choosing what to automate first (write down everything you did last week, circle anything that took >10 minutes or was done >once, pick the smallest circled item), workspace setup (directory structure, secrets permissions, writing SOUL.md/MEMORY.md/USER.md before anything else), and cron job verification (canary logging pattern to confirm jobs actually ran). The "building pipelines" section tackles the most common failure modes: same-step failures (isolate the step, check preceding output, verify environment), slow large-dataset processing (parallelize, use the right tool, cache external data, process incrementally), non-deterministic results (explicit datetime parameters, snapshot state not accumulated state, cache external data, pin random seeds), and email automation safeguards (draft mode first, hard recipient limits, dry run mode, unsubscribe tracking, bounce handling).
The "system running" section covers the essential debugging playbook: checking logs first (cron history, error grep, file changes, disk space, memory), diagnosing empty cron output (missing files, wrong location, swallowed errors), alert fatigue (review thresholds, make alerts actionable, time-aware escalation), and sub-agent failure modes (context too large, vague instructions, file permissions, network failures). The "scaling up" section covers cron inventory management, disk space triage, team sharing with namespace/access control, and audit trail patterns with structured JSONL logging.
The mental models section is the philosophical core: the contract model (preconditions/postconditions/error conditions for every operation), idempotency-first (design for "same result if run twice"), explicit state (write assumptions to files, don't rely on implicit context), boundary model (validate at every system boundary), minimal surface area (fewer moving parts), and reversibility (archive before delete, backup before overwrite, draft before send). These models are what make experienced automation engineers consistently better than beginners — applied immediately without additional technical knowledge.
The failure stories section provides hard-won lessons from real automation disasters: the wrong timezone deletion (store timestamps not durations), duplicate WhatsApp alert storm (persist last-checked state, not in-memory state), infinite research loop (explicit stopping criteria, not "be thorough"), schema mismatch silent failure (validate at input boundaries), cron job concurrency (lock files to prevent overlap), agent rewrite of production config (allowlists not blocklists, git for recovery), and cascading alert storm (cooldowns and deduplication). Each story follows a consistent format: what happened, how it was discovered, the fix applied, and the core lesson distilled.
Key Items
- **First-Automation Choice Exercise** — Write down everything you did last week, circle items that took >10 min or were repeated, pick the smallest circled item first; this breaks paralysis by starting with achievable wins rather than theoretically optimal automations
- **Pipeline Debugging Order** — Run failing step in isolation with known-good inputs first, then add output logging to the preceding step, then check environment (PATH, working directory, file permissions) if isolation succeeds but cron fails; systematic isolation beats random guessing
- **Non-Determinism Sources** — Time-dependent behavior (pass datetime as parameter), accumulated state (write complete snapshots not appends), external data changes (cache at pipeline start), random seeds (pin with random.seed(42)); track down source before trying to fix results
- **Mental Models** — Contract model (explicit pre/post/error conditions), idempotency-first, explicit state model, boundary model (validate at every external boundary), minimal surface area, reversibility model; these six models govern all production-quality automation design decisions
- **Failure Story Patterns** — Wrong timezone deletion (timestamps not durations), alert storm from memory state loss (persist to file), infinite loops from vague stopping conditions (explicit criteria), silent schema failures (validate at input boundaries), cron concurrency (lock files), production config deletion (allowlists + git); each teaches a specific defensive pattern
Related Concepts
- [[kelly-handbook-ch5-error-handling]] for error handling and debugging techniques
- [[kelly-handbook-ch6-scheduling-and-cron]] for cron best practices
- [[kelly-handbook-ch7-multi-agent]] for multi-agent coordination and failure recovery
- [[karpathy-llm-wiki]] for the LLM wiki pattern this KB follows