~/projects
Auto-Self-Healing Agents
Repo: mohd-vasim/ai-engineering → auto-self-healing-agents
Long-running agent systems crash. A worker hits an unhandled exception, the process dies, and any workflow that depends on it stalls until an engineer notices. In production, that means paged alerts at 3am for bugs that could be resolved in seconds. The Auto-Healing Agent Resuscitation pattern solves this with an external supervisor that watches workers through heartbeats and restarts the ones that go quiet.
The Problem
When agents are deployed as persistent processes — daemons, long-lived workers, microservices — they share the same failure modes as any other long-running process: unhandled exceptions, OOM kills, dependency corruption, segfaults. The agent's process disappears, and the system needs it back.
Without an automated recovery layer:
- The agent stays offline until an operator notices and intervenes
- Workflows that depend on the agent stall, breaking SLAs
- On-call engineers get paged for problems a script could fix in two lines
Naive fixes don't work either. A tight while True: agent.start() restart loop will burn CPU and logs when an agent has a persistent bug. A restart that loses state is just a different bug.
The Pattern: Auto-Healing Agent Resuscitation
An external supervisor runs alongside the worker pool and watches each worker's health through a heartbeat — a timestamp the worker writes to shared state (Redis) on a fixed cadence. The supervisor compares now - last_heartbeat against a timeout window. If a worker's heartbeat goes stale, the supervisor restarts it from its original definition.
Three things make this work in practice:
- Indirect observation — the supervisor never calls into the worker. It only reads heartbeats from Redis. This means a crashed worker process (which can't answer calls anyway) is still observable.
- Crash-loop backoff — if an agent keeps crashing, the supervisor waits progressively longer between restart attempts. The first failure restarts immediately, the second waits
base × factor¹, the third waitsbase × factor², capped atmax_wait. This prevents resource exhaustion while still allowing recovery from transient failures. - Pluggable restart backend — the actual restart mechanism is abstracted. In this implementation, it's a fresh in-process thread. In production, it could be a subprocess, a Kubernetes liveness probe, or an ECS task restart.
The Components
| Component | Responsibility |
|---|---|
| Supervisor | Owns the monitoring loop, calls is_agent_healthy for each worker, decides when to restart. |
| HeartbeatStore | Typed wrapper around Redis — reads/writes Heartbeat and HealthState records, maintains a capped action log. |
| Backoff | Computes wait_seconds(consecutive_failures) with exponential growth and a max_wait cap. |
| RestartBackend | Pluggable strategy for the actual restart — in-process thread here, could be subprocess, container, or K8s liveness probe in production. |
| WorkerAgent | Long-lived worker that emits heartbeats and exposes a liveness probe. The demo ships a DataProcessingAgent that can be configured to crash deterministically or randomly. |
The state in Redis is small and typed — every entry is a Pydantic model (Heartbeat, HealthState, SupervisorAction). Heartbeats are written by workers, health state by the supervisor, and the action log is a Redis list capped at 100 entries via LPUSH + LTRIM.
The 5-Step Recovery Flow
A full cycle from healthy operation through crash, detection, and recovery:
The whole cycle runs without human intervention. The force_crash(agent_id) method on the supervisor is purely a demo helper — production crashes come from the worker's own unhandled exceptions.
Key Design Decisions
Why heartbeats in Redis instead of in-process function calls? A crashed process can't answer function calls. The supervisor has to observe workers through shared state that survives the worker's death. Redis gives the supervisor a place to read "this worker last reported healthy at T" even after the worker's process is gone.
Why exponential backoff? A tight restart loop is worse than no auto-healing at all. If an agent has a deterministic crash bug, restarting it every 100ms burns CPU, floods logs, and never recovers. Exponential backoff gives transient failures room to clear while preventing persistent bugs from causing a self-inflicted outage.
Why a separate supervisor process? The supervisor must outlive any worker it manages. If the supervisor runs in the same process as the workers, a process-wide crash takes both down. The supervisor needs its own lifecycle.
Why cap the restart count?
Even with backoff, an agent that keeps crashing has a bug. Pair max_restarts with an alert (PagerDuty, Slack) so the system doesn't silently thrash forever on a code defect the auto-healer can't fix.
Why typed Pydantic models for heartbeats?
A heartbeat is a contract. If a worker writes {"ts": 1234} and the supervisor reads {"timestamp": 1234}, the supervisor silently treats every heartbeat as malformed. Pydantic turns that into an explicit validation error at the boundary.
Tech Stack
- LangChain — agent abstractions (
WorkerAgentinterface) for pluggable agent implementations. - Pydantic v2 — typed
Heartbeat,HealthState,SupervisorActionmodels for everything that crosses the Redis boundary. - Upstash Redis — shared state store for heartbeats, health state, and the action log (
LPUSH+LTRIM). - Streamlit — interactive dashboard with agent cards, an action table, and Plotly metrics charts for the restart counts.
- MLflow —
@mlflow.tracedecorators onis_agent_healthyandcheck_and_resuscitate, pluslog_metricfor failure and restart counters.
How to Run
git clone https://github.com/mohd-vasim/ai-engineering.git
cd ai-engineering/auto-self-healing-agents
uv sync
# .env
UPSTASH_REDIS_REST_URL=...
UPSTASH_REDIS_REST_TOKEN=...
MLFLOW_TRACKING_URI=databricks
DATABRICKS_TOKEN=...
DATABRICKS_HOST=...
# Run the Streamlit dashboard
uv run streamlit run app/streamlit_app.pyIn the dashboard you can:
- Spawn agents with configurable
crash_after_n_heartbeatsandcrash_probability - Click "Force Crash" on any agent and watch the supervisor detect the gap, apply backoff, and resuscitate
- Inspect the per-agent health state and the full action log
What's in the Repo
app/core/supervisor.py— theSupervisor,DataProcessingAgent,Backoff, andcheck_and_resuscitatelogic.app/core/redis_store.py— theHeartbeatStorewrapper around Upstash Redis with typed Pydantic models.app/streamlit_app.py+app/components/— the interactive dashboard (agent cards, action table, metrics charts).docs/ARCHITECTURE.md— the full Mermaid diagram set: system context, 5-step sequence, monitoring state machine, component diagram, heartbeat sequence, backoff timeline, Redis key layout, and failure/recovery state machine.PLAN.md— the build plan and progress tracker.tests/— pytest suite covering health checks, backoff calculation, supervisor recovery, and the demo flow.
Limitations & What's Next
This implementation handles process-level recovery only. State persistence across restarts (so a restarted agent picks up where the previous one left off) is a separate pattern — Incremental Checkpointing — and is explicitly out of scope.
The supervisor itself is a single point of failure. For production HA, run multiple supervisors with leader election (etcd, ZooKeeper).
Possible next iterations:
- Incremental checkpointing — persist agent state to Redis on each step, restore on restart.
- Distributed supervision — leader election across multiple supervisor instances.
- Container orchestrator backend — Kubernetes liveness probe + restart policy as the
RestartBackend. - Alerting — PagerDuty / Slack notifications when an agent exceeds
max_restarts. - Metrics & dashboards — Prometheus / OpenTelemetry integration for the failure and restart counters.