Back to projects

~/projects

Auto-Self-Healing Agents

multi-agentself-healingsupervisorlangchainstreamlitredis

Repo: mohd-vasim/ai-engineering → auto-self-healing-agents

Long-running agent systems crash. A worker hits an unhandled exception, the process dies, and any workflow that depends on it stalls until an engineer notices. In production, that means paged alerts at 3am for bugs that could be resolved in seconds. The Auto-Healing Agent Resuscitation pattern solves this with an external supervisor that watches workers through heartbeats and restarts the ones that go quiet.

The Problem

When agents are deployed as persistent processes — daemons, long-lived workers, microservices — they share the same failure modes as any other long-running process: unhandled exceptions, OOM kills, dependency corruption, segfaults. The agent's process disappears, and the system needs it back.

Without an automated recovery layer:

  • The agent stays offline until an operator notices and intervenes
  • Workflows that depend on the agent stall, breaking SLAs
  • On-call engineers get paged for problems a script could fix in two lines

Naive fixes don't work either. A tight while True: agent.start() restart loop will burn CPU and logs when an agent has a persistent bug. A restart that loses state is just a different bug.

The Pattern: Auto-Healing Agent Resuscitation

An external supervisor runs alongside the worker pool and watches each worker's health through a heartbeat — a timestamp the worker writes to shared state (Redis) on a fixed cadence. The supervisor compares now - last_heartbeat against a timeout window. If a worker's heartbeat goes stale, the supervisor restarts it from its original definition.

Three things make this work in practice:

  • Indirect observation — the supervisor never calls into the worker. It only reads heartbeats from Redis. This means a crashed worker process (which can't answer calls anyway) is still observable.
  • Crash-loop backoff — if an agent keeps crashing, the supervisor waits progressively longer between restart attempts. The first failure restarts immediately, the second waits base × factor¹, the third waits base × factor², capped at max_wait. This prevents resource exhaustion while still allowing recovery from transient failures.
  • Pluggable restart backend — the actual restart mechanism is abstracted. In this implementation, it's a fresh in-process thread. In production, it could be a subprocess, a Kubernetes liveness probe, or an ECS task restart.
System Context — Supervisor Observes Workers Indirectly workers and supervisor never talk directly — all coordination is through Redis heartbeats Supervisor Process Monitoring loop checks every agent each tick Decision logic healthy → log · unhealthy → resuscitate Backoff strategy exponential wait on repeated failures Worker Pool DataProcessor-1 DataProcessor-2 DataProcessor-N long-lived threads · emit heartbeats Upstash Redis ash:hb:{agent_id} ash:state:{agent_id} heartbeats + health state MLflow (Databricks) @mlflow.trace spans log_metric counters traces + restart metrics write hb read hb trace restart

The Components

ComponentResponsibility
SupervisorOwns the monitoring loop, calls is_agent_healthy for each worker, decides when to restart.
HeartbeatStoreTyped wrapper around Redis — reads/writes Heartbeat and HealthState records, maintains a capped action log.
BackoffComputes wait_seconds(consecutive_failures) with exponential growth and a max_wait cap.
RestartBackendPluggable strategy for the actual restart — in-process thread here, could be subprocess, container, or K8s liveness probe in production.
WorkerAgentLong-lived worker that emits heartbeats and exposes a liveness probe. The demo ships a DataProcessingAgent that can be configured to crash deterministically or randomly.

The state in Redis is small and typed — every entry is a Pydantic model (Heartbeat, HealthState, SupervisorAction). Heartbeats are written by workers, health state by the supervisor, and the action log is a Redis list capped at 100 entries via LPUSH + LTRIM.

The 5-Step Recovery Flow

A full cycle from healthy operation through crash, detection, and recovery:

One Crash, Five Recovery Steps from healthy operation to automatic restart — no human in the loop Step 1 · Normal worker writes heartbeat Redis ash:hb:DataProcessor-1 · t=0s SET heartbeat GET · age=1s · HEALTHY Step 2 · Crash process dies · no more hbs Redis hb unchanged · growing stale Step 3 · Detect age > timeout · UNHEALTHY Redis ash:hb:DataProcessor-1 · t=0s · age=30s GET · stale Step 4 · Resuscitate backoff wait · then restart Redis ash:state:DataProcessor-1 · total_restarts++ wait base×factorⁿ then agent_factory().start() — fresh process spawned Step 5 · Recover new worker · fresh heartbeat Redis ash:hb:DataProcessor-1 · t=40s · fresh SET heartbeat GET · age=0s · HEALTHY · consecutive_failures=0 red = crash · blue = worker action · green = supervisor read · yellow = Redis state

The whole cycle runs without human intervention. The force_crash(agent_id) method on the supervisor is purely a demo helper — production crashes come from the worker's own unhandled exceptions.

Key Design Decisions

Why heartbeats in Redis instead of in-process function calls? A crashed process can't answer function calls. The supervisor has to observe workers through shared state that survives the worker's death. Redis gives the supervisor a place to read "this worker last reported healthy at T" even after the worker's process is gone.

Why exponential backoff? A tight restart loop is worse than no auto-healing at all. If an agent has a deterministic crash bug, restarting it every 100ms burns CPU, floods logs, and never recovers. Exponential backoff gives transient failures room to clear while preventing persistent bugs from causing a self-inflicted outage.

Why a separate supervisor process? The supervisor must outlive any worker it manages. If the supervisor runs in the same process as the workers, a process-wide crash takes both down. The supervisor needs its own lifecycle.

Why cap the restart count? Even with backoff, an agent that keeps crashing has a bug. Pair max_restarts with an alert (PagerDuty, Slack) so the system doesn't silently thrash forever on a code defect the auto-healer can't fix.

Why typed Pydantic models for heartbeats? A heartbeat is a contract. If a worker writes {"ts": 1234} and the supervisor reads {"timestamp": 1234}, the supervisor silently treats every heartbeat as malformed. Pydantic turns that into an explicit validation error at the boundary.

Tech Stack

  • LangChain — agent abstractions (WorkerAgent interface) for pluggable agent implementations.
  • Pydantic v2 — typed Heartbeat, HealthState, SupervisorAction models for everything that crosses the Redis boundary.
  • Upstash Redis — shared state store for heartbeats, health state, and the action log (LPUSH + LTRIM).
  • Streamlit — interactive dashboard with agent cards, an action table, and Plotly metrics charts for the restart counts.
  • MLflow@mlflow.trace decorators on is_agent_healthy and check_and_resuscitate, plus log_metric for failure and restart counters.

How to Run

bash
git clone https://github.com/mohd-vasim/ai-engineering.git
cd ai-engineering/auto-self-healing-agents
uv sync

# .env
UPSTASH_REDIS_REST_URL=...
UPSTASH_REDIS_REST_TOKEN=...
MLFLOW_TRACKING_URI=databricks
DATABRICKS_TOKEN=...
DATABRICKS_HOST=...

# Run the Streamlit dashboard
uv run streamlit run app/streamlit_app.py

In the dashboard you can:

  • Spawn agents with configurable crash_after_n_heartbeats and crash_probability
  • Click "Force Crash" on any agent and watch the supervisor detect the gap, apply backoff, and resuscitate
  • Inspect the per-agent health state and the full action log

What's in the Repo

  • app/core/supervisor.py — the Supervisor, DataProcessingAgent, Backoff, and check_and_resuscitate logic.
  • app/core/redis_store.py — the HeartbeatStore wrapper around Upstash Redis with typed Pydantic models.
  • app/streamlit_app.py + app/components/ — the interactive dashboard (agent cards, action table, metrics charts).
  • docs/ARCHITECTURE.md — the full Mermaid diagram set: system context, 5-step sequence, monitoring state machine, component diagram, heartbeat sequence, backoff timeline, Redis key layout, and failure/recovery state machine.
  • PLAN.md — the build plan and progress tracker.
  • tests/ — pytest suite covering health checks, backoff calculation, supervisor recovery, and the demo flow.

Limitations & What's Next

This implementation handles process-level recovery only. State persistence across restarts (so a restarted agent picks up where the previous one left off) is a separate pattern — Incremental Checkpointing — and is explicitly out of scope.

The supervisor itself is a single point of failure. For production HA, run multiple supervisors with leader election (etcd, ZooKeeper).

Possible next iterations:

  • Incremental checkpointing — persist agent state to Redis on each step, restore on restart.
  • Distributed supervision — leader election across multiple supervisor instances.
  • Container orchestrator backend — Kubernetes liveness probe + restart policy as the RestartBackend.
  • Alerting — PagerDuty / Slack notifications when an agent exceeds max_restarts.
  • Metrics & dashboards — Prometheus / OpenTelemetry integration for the failure and restart counters.