How to Run AI Coding Agents Continuously for Days
- Decompose all work into atomic, file-scoped task files with explicit acceptance criteria and constrained file lists.
- Create a CLAUDE.md persistent memory file committed to the repo with architectural rules, active tasks, and do-not-touch zones.
- Write an orchestration script that feeds tasks sequentially to the Claude Code CLI in isolated sessions.
- Gate every task completion with automated tests, linting, and type checks before allowing a commit.
- Commit after every completed task so rollback is always a single
git revert. - Retry failed tasks automatically with a cap, then move persistent failures to a review queue.
- Update CLAUDE.md as part of each task’s definition of done so project memory grows over time.
- Review completed work periodically as a human to catch intent drift that automated tests miss.
Autonomous AI coding agents promise to ship features while developers sleep. The reality is less glamorous—most agents begin losing context after roughly an hour, hallucinating file structures, rewriting working code, or drifting so far from the original objective that their output becomes a liability. Running AI coding agents continuously for days requires more than a good prompt.
Table of Contents
Why AI Coding Agents Fall Apart After an Hour
Autonomous AI coding agents promise to ship features while developers sleep. The reality is less glamorous. In the author’s experience, most agents begin losing context after roughly an hour of continuous work, hallucinating file structures that do not exist, rewriting working code, or drifting so far from the original objective that their output becomes a liability rather than a contribution. Running AI coding agents continuously for days requires more than a good prompt.
Simon Last, co-founder of Loom, publicly documented a methodology for running AI coding agents continuously for 13 days on real production work. The approach caught attention across the developer community not because it relied on some secret model capability, but because it treated the agent as an unreliable worker and built disciplined infrastructure around it: automated rollback, test gates, and persistent memory. This article distills that methodology into a reproducible system architecture for long-running agentic coding sessions using Claude Code and related tools like Cursor.
System requirements: Bash ≥4.x (macOS users: the default system bash is 3.2 — install a newer version via brew install bash), Git ≥2.23, Node.js 20+, and the Claude Code CLI installed and authenticated. Run claude --version to confirm your CLI is working. If you have not yet set up authentication, run claude login or set the ANTHROPIC_API_KEY environment variable. Verify that claude is on your $PATH before proceeding.
Prerequisites: working familiarity with the Claude Code CLI, experience with agentic workflows, and comfort with project-level AI configuration files.
Why AI Coding Agents Lose the Plot
Context Window Saturation
Every AI coding agent operates within a finite context window. As a session extends, the volume of code, conversation history, and intermediate reasoning accumulates until it exceeds what the model can hold. At that point, the model summarizes or drops older context entirely. The problem compounds: a small piece of lost context — say the rationale behind an architectural decision made 40 minutes ago — cascades into downstream errors. The agent makes choices that contradict earlier work, and each subsequent decision builds on an increasingly corrupted foundation. Note that context handling behavior varies by model and implementation; consult Anthropic’s documentation for Claude-specific details.
Goal Drift and Hallucination Accumulation
Without external grounding, agents invent subtasks that no one requested, forget the original objective mid-session, or confidently rewrite working code based on hallucinated requirements. This is the “telephone game” effect — each retelling introduces small distortions that compound across iterations. Across a long conversation, the agent’s internal representation of the goal mutates incrementally. In Simon Last’s logged sessions, goal drift became apparent by the two-hour mark, though timing varies by task complexity. By that point, the agent may be solving an entirely different problem than the one it was given, with no awareness that it has drifted.
Without external grounding, agents invent subtasks that no one requested, forget the original objective mid-session, or confidently rewrite working code based on hallucinated requirements.
No Persistent Memory by Default
Human developers rely on issue trackers, architectural decision records, git history, and handwritten notes to maintain continuity across days of work. Most agent setups provide none of this. Each interaction is ephemeral. When a session ends — whether by hitting the context window limit or explicit session termination — every lesson, every codebase insight, every convention the agent picked up vanishes. Unless developers deliberately build persistence mechanisms, the next session starts from near-zero.
The Core Architecture for Multi-Day Agent Runs
Principle 1: Break Work into Atomic, File-Scoped Tasks
Never give the agent a sprawling goal. Instead, structure all work as small, self-contained task files written in markdown, each with explicit acceptance criteria. Every task should be completable within one focused session. As a starting heuristic (calibrate per-project), scope tasks to roughly 30 to 60 minutes of agent work. Scope each task to specific files, with clear boundaries on what the agent may and may not modify.
A sample task file illustrates the structure:
# Task 017: Refactor Auth Middleware to Support JWT Rotation
## Objective
Refactor `src/middleware/auth.ts` to accept multiple valid JWT signing keys,
enabling zero-downtime key rotation.
## Constraints
- Only modify `src/middleware/auth.ts` and `src/config/auth-keys.ts`
- Do NOT modify any test files
- Maintain backward compatibility with existing single-key configuration
## Files in Scope
- src/middleware/auth.ts
- src/config/auth-keys.ts
## Definition of Done
- All existing tests in `tests/middleware/auth.test.ts` pass without modification
- New configuration accepts an array of signing keys
- Middleware validates tokens against all active keys in order
## Rollback
- `git revert HEAD` restores previous single-key behavior
(assumes this task's changes were committed as a single atomic commit)
This level of specificity prevents the agent from interpreting the task broadly and wandering into unrelated refactors.
Principle 2: Use CLAUDE.md as Persistent System Memory
CLAUDE.md is the project-level instruction file that Claude Code reads at the start of every session (see Claude Code’s official documentation to confirm this behavior for your installed version). It persists across sessions because it is committed to the repository, making it the natural location for everything the agent needs to know but cannot remember on its own. The file should contain architectural decisions, coding standards, known pitfalls, protected zones, and current sprint context.
# CLAUDE.md — Project Persistent Memory
## Project Context
- Monorepo: TypeScript backend (Node 20) + React frontend
- ORM: Drizzle with PostgreSQL 16
- Auth: Custom JWT middleware (see src/middleware/auth.ts)
## Active Task Queue
- Task 017: Refactor auth middleware for JWT rotation
- Task 018: Add rate limiting to public API endpoints
- Task 019: Migrate user preferences to new schema
## Completed Tasks Log
# Example entry — replace with your actual task summaries and dates
- Task 016: Extracted database connection pooling to shared module (2024-01-15)
- Task 015: Fixed N+1 query in dashboard endpoint (2024-01-14)
## Architectural Invariants
- All database queries go through the repository layer — never call Drizzle directly from routes
- Error responses follow RFC 7807 Problem Details format (a standard format for HTTP API error responses)
- No default exports — use named exports exclusively
## Do Not Touch
- `src/core/billing/` — active rewrite by another team, conflicts guaranteed
- `migrations/` — only human-authored migration files
## Agent Behavioral Rules
- Run `npm test` before considering any task complete
- Commit messages follow Conventional Commits format
- If unsure about scope, stop and ask rather than guessing
Every fresh session begins with the agent reading this file cold, which provides continuity without relying on conversational memory.
Principle 3: Checkpoint and Commit Obsessively
What happens when you skip checkpointing? The agent drifts for 90 minutes, touches fourteen files, and your only recovery option is git diff archaeology on a tangled mess. The agent should never be more than one atomic change away from a clean rollback point. Enforce a git commit after every completed subtask so that when something goes wrong — and it will — recovery is a single git revert rather than an archaeological dig through tangled changes. This discipline also creates a detailed audit trail that humans can review asynchronously.
The agent should never be more than one atomic change away from a clean rollback point.
Implementing the Task Loop System
The Orchestration Script
The system that ties everything together is a lightweight orchestration script that feeds tasks to Claude Code sequentially. It reads from a task queue directory, invokes the agent per task, validates output, commits results, and advances to the next task. Failed tasks get moved to a separate directory with an error log for human review.
Important: Before running this script, verify the Claude Code CLI flag names against your installed version by running claude --help. The flags --print, --allowedTools, and --max-turns are used below; confirm they appear exactly as written in your CLI’s help output. Also confirm that Claude Code CLI accepts task content via stdin redirect by running: echo 'test task' | claude --print and checking for expected behavior. Consult claude --help for the correct input mechanism if stdin does not work.
Important: This script must be run on a dedicated git branch, not on main or master. The retry loop discards uncommitted changes on failure, and commits are made automatically. Running on a dedicated branch ensures you can safely discard or merge the results.
#!/usr/bin/env bash
set -euo pipefail
TASK_DIR="./tasks/pending"
DONE_DIR="./tasks/done"
FAIL_DIR="./tasks/failed"
LOG_DIR="./tasks"
LOG_FILE="$LOG_DIR/orchestration.log"
MAX_RETRIES=2
RETRY_DELAY=30
mkdir -p "$DONE_DIR" "$FAIL_DIR" "$LOG_DIR" "./tasks/logs"
if ! command -v claude &> /dev/null; then
echo "ERROR: claude CLI not found on PATH. Install and authenticate first." >&2
exit 1
fi
if ! git diff --quiet || ! git diff --cached --quiet; then
echo "ERROR: dirty working tree. Commit or stash changes before running orchestration." >&2
exit 1
fi
if ! npm test > /dev/null 2>&1; then
echo "[$(date)] ERROR: Pre-flight 'npm test' failed. Fix the test suite before running orchestration." | tee -a "$LOG_FILE" >&2
exit 1
fi
for task_file in "$TASK_DIR"/*.md; do
[ -f "$task_file" ] || continue
task_name=$(basename "$task_file")
if [[ ! "$task_name" =~ ^[a-zA-Z0-9_.\-]+$ ]]; then
echo "[$(date)] ERROR: Task filename contains unsafe characters: $task_name — skipping" >> "$LOG_FILE"
continue
fi
echo "[$(date)] Starting: $task_name" >> "$LOG_FILE"
mapfile -t scoped_files < <(
awk '/^## Files in Scope/{found=1; next} /^##/{found=0} found && /^- /{print substr($0,3)}' \
"$task_file"
)
if [ ${#scoped_files[@]} -eq 0 ]; then
echo "[$(date)] ERROR: No files in scope parsed from $task_name — moving to failed" >> "$LOG_FILE"
mv "$task_file" "$FAIL_DIR/"
continue
fi
protected_patterns=("src/core/billing/" "migrations/")
scope_violation=false
for f in "${scoped_files[@]}"; do
for p in "${protected_patterns[@]}"; do
if [[ "$f" == "$p"* ]]; then
echo "[$(date)] ERROR: Scoped file $f matches protected path $p — skipping task $task_name" >> "$LOG_FILE"
scope_violation=true
break 2
fi
done
done
if [ "$scope_violation" = true ]; then
mv "$task_file" "$FAIL_DIR/"
continue
fi
retries=0
success=false
while [ $retries -lt $MAX_RETRIES ]; do
TASK_LOG="./tasks/logs/${task_name%.md}.out"
mkdir -p "$(dirname "$TASK_LOG")"
if timeout 300 claude --print \
--allowedTools "Edit,Write,Bash" \
--max-turns 25 \
< "$task_file" \
> "$TASK_LOG" \
2>> "$LOG_FILE"; then
if [ ! -s "$TASK_LOG" ]; then
echo "[$(date)] WARN: claude produced no output for $task_name — possible stdin delivery failure" >> "$LOG_FILE"
else
if npm test >> "$LOG_FILE" 2>&1; then
git add -- "${scoped_files[@]}"
if git diff --name-only --cached | grep -qE '^(src/core/billing|migrations)/'; then
echo "[$(date)] ERROR: Protected files staged after git add — aborting commit for $task_name" >> "$LOG_FILE"
git restore --staged .
git restore .
git clean -fd
retries=$((retries + 1))
echo "[$(date)] Retry $retries/$MAX_RETRIES for: $task_name (sleeping ${RETRY_DELAY}s)" >> "$LOG_FILE"
sleep "$RETRY_DELAY"
continue
fi
git commit -m "feat: complete ${task_name%.md} (automated agent run)"
mv "$task_file" "$DONE_DIR/"
echo "[$(date)] Completed: $task_name" >> "$LOG_FILE"
success=true
break
fi
fi
fi
retries=$((retries + 1))
echo "[$(date)] Retry $retries/$MAX_RETRIES for: $task_name (sleeping ${RETRY_DELAY}s)" >> "$LOG_FILE"
git restore .
git clean -fd
sleep "$RETRY_DELAY"
done
if [ "$success" = false ]; then
mv "$task_file" "$FAIL_DIR/"
echo "[$(date)] Failed after retries: $task_name" >> "$LOG_FILE"
fi
done
echo "[$(date)] Orchestration complete." >> "$LOG_FILE"
The script uses Claude Code’s --print flag for non-interactive output, --allowedTools to restrict the agent to editing, writing, and bash execution, and --max-turns to cap each session’s length. Each claude invocation is wrapped with timeout to prevent a hung CLI process from blocking the loop indefinitely. These constraints prevent runaway sessions.
Session Boundaries as a Feature, Not a Bug
Intentionally killing and restarting agent sessions after each task is a feature of this architecture, not a limitation. Each fresh context window means the agent reads CLAUDE.md and the specific task file without any accumulated conversational noise from previous tasks. Shorter individual sessions enable much longer total productive runs because each session starts clean.
Validation Gates Between Tasks
Automated validation between tasks catches problems before they compound. Automated tests catch regressions, but they cannot verify architectural intent — human reviewers do that through periodic review. If validation fails, the pipeline does not proceed. The task gets retried or flagged for human review.
The validation logic is embedded directly in the orchestration script above (the npm test check followed by commit-or-retry). Linters, type checkers, and integration tests can all slot into this gate by adding them alongside the npm test invocation. The key principle is that no task’s output reaches the main branch without passing automated verification.
Comparison: Typical Agent Limits vs. Extended Runs with Survival Techniques
| Dimension | Typical Agent Run | Extended Multi-Day Run |
|---|---|---|
| Max effective duration | ~1 hour | 13+ days (per Simon Last’s documented run) |
| Context management | Single conversation window | Per-task fresh context + CLAUDE.md |
| Task granularity | Vague, multi-step prompts | Atomic, file-scoped task files |
| Memory persistence | None (in-context only) | CLAUDE.md + git history + task logs |
| Error recovery | Manual intervention | Auto-retry + failed task queue |
| Commit frequency | End of session (maybe) | After every completed task |
| Validation | Manual review | Automated gates (tests, lint, types) |
| Drift detection | Human notices (eventually) | Validation failures halt pipeline |
| Rollback capability | Manual undo | git revert one atomic commit |
| Human oversight needed | Constant babysitting | Periodic review of completed queue |
The extended approach does not make the agent more capable. The agent’s capabilities remain identical; what changes is the infrastructure that constrains, validates, and recovers from the agent’s inevitable mistakes.
Advanced Techniques for Resilience
Self-Updating CLAUDE.md
The most powerful extension of this architecture is making the agent responsible for updating CLAUDE.md as part of each task’s definition of done. This creates a living project memory that grows more useful over time.
## On Completion (append to task file instructions)
After completing this task:
1. Add an entry to CLAUDE.md under "Completed Tasks Log" with the date and summary
2. If you discovered any caveats or gotchas, add them under "Known Issues"
3. If architectural decisions were made, document them under "Architectural Invariants"
4. Update "Active Task Queue" to remove this task
Over days of operation, CLAUDE.md accumulates institutional knowledge that makes subsequent tasks more accurate. The agent effectively builds its own onboarding document.
Parallel Agent Tracks with Dependency Awareness
Multiple agents can run simultaneously on non-overlapping file scopes. The simplest approach partitions tasks by directory or module and runs separate orchestration loops, each with its own task queue. Branch-per-agent strategies prevent merge conflicts: each agent works on a dedicated branch, and a human or CI process handles integration. The details of branch naming conventions, merge conflict resolution, and CI configuration depend on your team’s workflow and are beyond the scope of this article.
File-level lock registries offer one coordination mechanism (not yet validated at scale): if task files explicitly declare their file scope, the orchestration layer can verify that no two concurrent agents claim overlapping files. Conflicts get queued rather than collided. Implementation options include shell-level flock commands or a shared JSON lockfile, though the specifics will vary by project setup.
Human-in-the-Loop Checkpoints
Scheduled mandatory human review every N tasks, such as every 20 completed tasks or every 8 hours, catches the category of problems that automated tests miss. Reviewers examine the git log, verify CLAUDE.md updates are accurate, and sanity-check that architectural direction has not subtly shifted. Automated tests verify behavior; human review verifies intent.
Cursor Integration for Review and Intervention
When the orchestration script flags a task as failed after exhausting retries, Cursor provides a natural interface for human intervention. Developers can open the failed task’s context in Cursor, use its AI-assisted review capabilities to diagnose the failure, fix or clarify the task specification, and requeue it. This keeps the human role focused on exception handling rather than routine oversight. Verify that your version of Cursor supports inline AI-assisted code review, as features may vary across releases.
Common Failure Modes and How to Diagnose Them
The Agent Rewrites Code It Should Not Touch
Every task file must include a “Files in Scope” section, and CLAUDE.md must maintain a “Do Not Touch” list. Without these constraints, the agent will modify files outside its task scope. The --allowedTools flag also enforces scope by restricting which operations the agent can perform. The orchestration script reinforces this at commit time by staging only declared in-scope files rather than using git add -A.
Tests Pass but Behavior Is Wrong
The agent causes this when it modifies test files to match a broken implementation rather than fixing the implementation to match the tests. Fix it structurally: for implementation tasks, keep test files out of the agent’s allowed write scope. The task file’s constraints section should explicitly state that test modifications are not permitted.
The agent causes this when it modifies test files to match a broken implementation rather than fixing the implementation to match the tests.
Task Queue Stalls on a Single Failing Task
A task that fails repeatedly typically indicates under-specification or implicit dependencies on work that has not been completed yet. The orchestration script’s max retry limit combined with the move-to-failed-directory pattern prevents a single bad task from blocking the entire pipeline. A skip-and-continue option in the orchestration logic allows downstream tasks to proceed if they have no dependency on the stalled task.
Common Pitfalls
- The
claudeCLI will fail when rate-limited or when your account quota is exceeded. The orchestration script includes a retry delay between attempts but does not distinguish rate-limit errors from task failures. Monitor your usage during extended runs and increaseRETRY_DELAYif you observe frequent quota-related failures. - If
./tasks/pending/contains no.mdfiles, the script completes silently with no output. Expected behavior, but confusing if you are not watching for it. git restore .only affects tracked files: The orchestration script usesgit clean -fdalongsidegit restore .to also remove untracked files created by the agent during a failed attempt. Be aware thatgit clean -fdis destructive — run only on a dedicated branch.- Granting
--allowedTools "Edit,Write,Bash"gives the agent the ability to run arbitrary shell commands. Consider the security implications for your environment and restrict further if needed. One approach: wrap the Bash tool behind an allowlist script that only permits a declared set of commands (e.g.,npm,npx,node,git,cat,ls,grep,find). - Each
claudeinvocation is wrapped withtimeout 300(5 minutes). If a task routinely needs more time, increase this value. If the CLI hangs, the timeout ensures the orchestration loop eventually continues. - Task files without a “Files in Scope” section will be moved to the failed directory without being attempted. The orchestration script parses this section to determine which files to stage.
Key Takeaways: Building Systems, Not Prompts
What enables multi-day agent runs is not a better prompt. It is an engineering system that treats the AI agent as an unreliable worker requiring structure, validation, and persistent memory. The three core principles: atomic file-scoped tasks, persistent memory through CLAUDE.md, and obsessive checkpointing via git commits after every task.
Developers looking to adopt this approach should start with a single overnight run using the orchestration script pattern, validate the results in the morning, and iteratively extend. The Claude Code CLI documentation covers the specific flags and configuration options referenced throughout this system.

