Back to Writing
January 30, 202645 min readEngineeringAI ArchitectureDeep Dive

Engineering Production-Grade Autonomous Agent Orchestration: A Complete Technical Deep Dive

How I built a multi-agent system with ephemeral subagents, file-based state persistence, SAFe methodology integration, and production deployment pipelines—validated against authoritative patterns from Anthropic, OpenAI, Microsoft, and Manus AI.

1. The Problem: Context Window Overflow

It started with a familiar frustration. I had built Nabster—my autonomous AI operations hub—to handle everything from X/Twitter management to candidate pipeline monitoring. It worked beautifully until I asked it to write code.

The problem wasn't capability. Claude Code is extraordinarily capable. The problem was context accumulation. Every file read, every command executed, every iteration on a bug—it all stacked up. Within a single coding session, I'd watch the context window fill: 50K tokens, 100K, 150K. Eventually, the dreaded error:

LLM request rejected: input length and max_tokens exceed context limit

This wasn't sustainable. I needed a coding agent that could:

  • 1.Work without filling up its own context window
  • 2.Survive session crashes and resume seamlessly
  • 3.Maintain state across ephemeral executions
  • 4.Integrate with proper product management workflows

The Core Insight: The solution wasn't to make agents smarter—it was to make them ephemeral. Spawn fresh, do work, persist state to files, terminate. The files become the memory. The agent becomes disposable.

2. Research Phase: Learning from the Giants

Before writing a single line of configuration, I dove deep into authoritative sources. If I was going to build production-grade agent orchestration, I needed to understand what the leading AI labs had learned.

Anthropic's Multi-Agent Research System

Anthropic published detailed documentation on their Claude Research feature—a multi-agent system that achieved 90.2% performance improvement over single-agent approaches. Key patterns:

  • Orchestrator-worker pattern: A lead agent coordinates specialized subagents
  • Detailed task specifications: Each subagent needs objective, output format, tool guidance, and task boundaries
  • External memory persistence: Save plans to files before context exceeds limits
  • Lightweight references: Pass file paths between agents, not full content

Anthropic's claude-progress.txt Pattern

Their documentation on long-running agents revealed a elegant pattern: the claude-progress.txt file.

“Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.”

— Anthropic Engineering Blog

The solution: An Initializer agent creates the environment and a progress file. The Coding agent reads the progress file on every spawn, makes incremental changes, commits to git, and updates the progress file. Git history plus progress file equals recovery mechanism.

Manus AI's Context Engineering

Manus AI published fascinating insights on attention manipulation. Their agents create and continuously update todo.md files—not just for organization, but as a deliberate mechanism to keep objectives in the model's recent attention span.

With an average of 50 tool calls per task, maintaining focus is critical. By “reciting” objectives at the end of the context, Manus reduces goal drift and misalignment. They also treat the file system as external memory—unlimited in size, persistent by nature, directly operable by the agent.

Interestingly, Manus evolved from a simple todo.md approach to a dedicated Planner + Executor architecture—finding that roughly 30% of actions were spent updating the todo list. This validated my own architectural instincts.

Microsoft Azure AI Agent Design Patterns

Microsoft's Azure Architecture Center provided a comprehensive taxonomy of orchestration patterns:

PatternUse CaseTrade-offs
SequentialClear stage dependenciesSimple but higher latency
ConcurrentIndependent subtasksHigher throughput, needs merge
HandoffDynamic delegationFlexible but risk of loops
HierarchicalManager coordinates specialistsClear ownership, more hops

They also emphasized checkpoint features for recovery, circuit breakers to prevent cascading failures, and graceful degradation when agents fail.

SagaLLM: Academic Rigor

A VLDB 2025 paper on multi-agent coordination provided the final piece: structured handoffs.

“Free-text handoffs are the main source of context loss. Treat inter-agent transfer like a public API.”

— SagaLLM, VLDB 2025

The recommendation: Use JSON Schema-based structured outputs for all handoffs. Validate contract conformance, dependency satisfaction, and cross-agent consistency.

3. Architecture Evolution: From Dev to PM+Coder

My first design was “Dev Nabster”—a single coding agent with file-based state. It worked. Tests passed. Recovery from crashes succeeded. But something was fundamentally missing.

The problem: I was giving vague requirements and expecting perfect code. “Add a rate limiter” doesn't specify limits, storage mechanism, error responses, or header formats. The agent was guessing. Sometimes correctly, often not.

The solution emerged from product management principles: separate planning from execution.

The Two-Agent Architecture

Main Nabster (always running)
    │
    └── PM Nabster (ephemeral)
            │
            └── Coder Nabster (ephemeral)

PM Nabster owns requirements. It asks clarifying questions, creates dev-ready stories with acceptance criteria, reviews completed work, handles deployment, and verifies in production.

Coder Nabster owns implementation. It receives stories with clear criteria, implements exactly what's specified, writes tests, and submits for review. If rejected, it fixes and resubmits.

This mirrors Manus AI's evolution from todo.md to Planner+Executor. The separation isn't arbitrary—it's a recognition that planning and coding require different modes of thinking.

4. SAFe Methodology: WSJF and Track Selection

Not every request deserves the same process. A critical bug fix shouldn't go through the same ceremony as a multi-week feature. PM Nabster implements WSJF (Weighted Shortest Job First) from SAFe to intelligently route work.

The WSJF Assessment

WSJF SCORING:
- Business Value: [1-5]
- Time Criticality: [1-5]
- Risk/Opportunity: [1-5]
- Size: [XS/S/M/L/XL]

Score = (Value + Urgency + Risk) / Size

Three Tracks

HOTFIX Track

Quick triage → Immediate coding → Fast review → Ship. For production fires and showstoppers.

STANDARD Track

Light refinement (2-3 questions) → Story with criteria → Code → Full review → Ship.

PROJECT Track

Discovery → Planning → Milestones → Stories → Review loops → Stakeholder check-ins.

This right-sizing ensures we're not over-engineering simple fixes or under-planning complex features.

5. File-Based State: The Memory Architecture

The key insight from all the research: agents are ephemeral, files are memory. Here's the complete state architecture:

PM Nabster's State Files

/home/clawdbot/pm-nabster/
├── SOUL.md              # Identity, principles, methodology
├── AGENTS.md            # Operating rules, protocols
├── progress.md          # Current state, file references
├── intake/              # Original requests (verbatim)
│   └── REQ-2026-01-30-001.json
├── sessions/            # Q&A history, decisions
│   └── 2026-01-30-REQ-001.md
├── backlog/
│   ├── ready/           # Stories ready for Coder
│   ├── in-progress/     # Stories being implemented
│   └── done/            # Completed with verification
├── checkpoints/         # Recovery snapshots
└── templates/           # Story, review templates

The Context Recovery Protocol

When PM Nabster spawns, it follows a strict protocol:

  1. 1. Read SOUL.md (identity)
  2. 2. Read progress.md (where am I?)
  3. 3. If mid-task, read referenced files:
    • • Intake file (original request)
    • • Session file (Q&A history)
    • • Story file (if exists)
  4. 4. Continue from where previous spawn stopped

The critical rule: Never restart from scratch. Never assume context. Always read the files.

Attention Management

Following Manus AI's pattern, both agents implement attention management:

// Before any major action:

1. Re-read current objectives from progress.md

2. Explicitly state: "Current objective: [X]. Next action: [Y]"

3. This keeps goals in recent attention, prevents drift

6. Production Deployment Pipeline

A critical realization: “code complete” is not “done.” PM Nabster owns the full lifecycle:

1
Commit
2
Push
3
Build
4
Deploy
5
Verify

Production Verification Protocol

This is where most automation stops—and where we go further. PM Nabster actually hits the live endpoints to verify functionality:

# For a rate limiter feature:

# Test 1: Endpoint responds
curl -I https://production.url/api/hello
# Verify: HTTP 200, X-RateLimit headers present

# Test 2: Rate limiting works
for i in {1..105}; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" ...)
  echo "Request $i: $STATUS"
done
# Verify: Requests 1-100 return 200
# Verify: Requests 101+ return 429

All verification results are documented with evidence:

{
  "story_id": "STORY-001",
  "verification_status": "PASSED",
  "test_results": {
    "endpoint_works": "PASSED - HTTP 200",
    "rate_headers_present": "PASSED - X-RateLimit-*",
    "allows_100_requests": "PASSED - 100x HTTP 200",
    "blocks_101_plus": "PASSED - 5x HTTP 429"
  },
  "production_url": "https://...",
  "commit": "972a8b1"
}

7. Live Testing: Rate Limiter End-to-End

Theory is nothing without practice. Here's the complete flow from a real test:

The Request

“Add a rate limiter to the API”

PM Nabster's WSJF Assessment

Business Value: 3 (security/stability)
Time Criticality: 2 (not urgent)
Risk/Opportunity: 2 (prevents abuse)
Size: S (middleware pattern)

Score: (3+2+2)/2 = 3.5 → STANDARD Track

Clarifying Questions

PM Nabster asked:

  • • Is in-memory storage acceptable, or do you need Redis/persistent storage?
  • • What's the rate limit? (requests per minute per IP)
  • • Is Express.js acceptable for the example server?

Stakeholder answers: In-memory is fine. 100 requests/minute/IP. Express is fine.

The Story

STORY-001: Implement API Rate Limiter Middleware

Acceptance Criteria:
1. Given a client IP makes requests,
   When count <= 100 in last minute,
   Then request should be allowed

2. Given a client IP has made 100 requests,
   When they make another,
   Then return HTTP 429 with retryAfter

3. Given a client was rate limited,
   When 1 minute passes,
   Then their count resets

4. Given multiple client IPs,
   When each makes requests,
   Then each has independent limit

5. Given any request,
   Then X-RateLimit-Remaining header included

Coder Nabster's Implementation

Coder received the story and produced:

  • src/middleware/rateLimiter.js - Core middleware
  • src/middleware/rateLimiter.test.js - 26 comprehensive tests
  • src/server.js - Example Express server

PM Review

PM Nabster verified:

  • &check; All 5 acceptance criteria met
  • &check; All 4 edge cases handled
  • &check; 26 tests passing
  • &check; No security issues
  • &check; VERDICT: APPROVED

Production Verification

Server started, ngrok tunnel created, PM hit the live endpoint:

# Request 1

HTTP/2 200

x-ratelimit-remaining: 99

# Request 100

HTTP/2 200

x-ratelimit-remaining: 0

# Request 101

HTTP/2 429

{"error":"Too Many Requests","retryAfter":12}

Result: STORY-001 deployed and verified in production. All acceptance criteria confirmed working on live infrastructure.

8. Best Practices Audit

After building the system, I audited it against the authoritative sources. The alignment was strong:

Best PracticeSourceOur Implementation
Orchestrator-worker patternAnthropic&check; Main → PM → Coder
File-based state persistenceAnthropic, Manus&check; intake/, sessions/, backlog/
Planner + Executor separationManus&check; PM + Coder
Attention management (todo.md)Manus&check; progress.md + recitation
Structured JSON handoffsSagaLLM, OpenAI&check; Story JSON with schema
Circuit breakersMicrosoft&check; 3-failure escalation
CheckpointingMicrosoft&check; checkpoints/ directory
Graceful degradationMicrosoft, Anthropic&check; Failure protocols

Based on the audit, I added improvements: explicit attention management instructions, circuit breaker rules for repeated failures, schema validation requirements, and context limit awareness.

9. The Agent Creation Standard

With the patterns proven, I needed to ensure future agents would follow the same rigor. I created a comprehensive standard document at:

/home/clawdbot/nabster/standards/AGENT-CREATION-STANDARD.md

The standard mandates:

Required Workspace Structure

/home/clawdbot/[agent-name]/
├── SOUL.md              # Identity (REQUIRED)
├── AGENTS.md            # Operating rules (REQUIRED)
├── progress.md          # State for continuity (REQUIRED)
├── checkpoints/         # Recovery snapshots
└── [domain-specific]/   # Role-specific directories

Required SOUL.md Sections

  • • Identity statement (who, what, NOT what)
  • • Core principles (3-5)
  • • Hierarchy position
  • • Critical rules

Required AGENTS.md Protocols

  • • On Every Spawn protocol
  • • Attention management
  • • Failure & recovery protocol
  • • Context recovery protocol
  • • Autonomy levels
  • • Circuit breaker rules

Testing Requirements

Before deploying any new agent:

  • Fresh spawn test: Verify it reads SOUL.md, writes progress.md
  • Recovery test: Set mid-task state, verify continuation
  • Failure test: Create blocker, verify graceful handling
  • Handoff test: Verify context transfers correctly

10. The Chain Rule: Ensuring Compliance Forever

A standard is useless if agents don't follow it. The question: how do we guarantee every future agent uses the standard?

The answer: The Chain Rule. Every agent that spawns another agent must include this in the spawn prompt:

STANDING RULE: If you ever create a NEW agent type, you MUST first read
/home/clawdbot/nabster/standards/AGENT-CREATION-STANDARD.md and follow it
completely. Pass this rule to any agent you spawn.

This creates an unbroken chain:

Main Nabster spawns PM Nabster
    ↓ includes standing rule
PM Nabster spawns Coder Nabster
    ↓ includes standing rule
Coder Nabster spawns [future agent]
    ↓ includes standing rule
...forever

Reinforcement Points

The rule is embedded in multiple places:

  • • Main Nabster's SOUL.md (Principle #7)
  • • Main Nabster's AGENTS.md (explicit section)
  • • Main Nabster's MEMORY.md (Standing Rules)
  • • PM Nabster's SOUL.md (Critical Rule #6)
  • • PM Nabster's AGENTS.md (Standing Rule section)
  • • Coder Nabster's AGENTS.md (Standing Rule section)
  • • The standard document itself

Weekly Audit

Every Saturday, Main Nabster performs an audit: verify all registered agents have the required files and sections. Any gaps are reported and fixed.

## Weekly Agent Audit Report - [Date]

| Agent        | SOUL.md | AGENTS.md | progress.md | Standard |
|--------------|---------|-----------|-------------|----------|
| PM Nabster   | ✓       | ✓         | ✓           | ✓        |
| Coder Nabster| ✓       | ✓         | ✓           | ✓        |

11. Conclusion: What We Built

In a single session, we architected and implemented a production-grade multi-agent orchestration system:

1

Ephemeral Agents with File-Based Memory

Agents spawn fresh, read state from files, persist before termination. Context dies, memory lives.

2

PM + Coder Separation

Planning and execution as distinct roles. Clear handoffs via structured JSON stories.

3

SAFe Methodology Integration

WSJF scoring routes work to appropriate tracks. Right-sized process for every request.

4

Full Deployment Pipeline

Commit, push, build, deploy, verify in production. Evidence documented.

5

Agent Creation Standard

Templates, checklists, and the Chain Rule ensure every future agent follows the patterns.

The system is now live. I can ask Nabster to build any feature, and it flows through PM for refinement, to Coder for implementation, back to PM for review, through deployment, and into production with verification.

More importantly, the patterns are documented and enforced. This isn't a one-off solution—it's infrastructure for building reliable autonomous systems at scale.

The Meta-Lesson

The best AI systems aren't the ones with the most capabilities—they're the ones with the clearest boundaries. By making agents ephemeral and files permanent, by separating planning from execution, by right-sizing process to complexity, we built something that's both powerful and predictable. That's the goal.

Sources & References