Back to blog

~/blog

Hermes Agent's "Self-Improving" Backbone: What It Actually Does (And What It Doesn't)

Jun 28, 20268 min readBy Mohammed Vasim
ai-engineeringagent-architecturehermes-agentself-improving-agentsllm-agents

Every few months, a new agent framework ships with "self-improving" stamped on the README. Most of the time, nobody asks the obvious follow-up: self-improving how, exactly?

Hermes Agent by Nous Research is the most serious attempt at this idea in the open-source space. It's a real codebase — ~500K lines, production deployments, 40+ built-in tools, 15+ platform integrations. It deserves a real look, not a README skim.

Here's what the architecture actually does.


What Hermes Claims vs What It Actually Does

Hermes markets itself as a self-improving AI agent. When you read past the homepage, the "self-improvement" cashes out as four mechanisms:

  1. Skill auto-generation — after complex tasks, the agent writes reusable markdown skill files
  2. Persistent memory — MEMORY.md and USER.md, capped and frozen per session
  3. Offline RL fine-tuning — Tinker/Atropos pipeline producing LoRA adapters
  4. Pull-based updateshermes update pulling from upstream

Three of these are genuinely useful. One of them is a trap — at least if you're using paid models like Claude, GPT-4, or Gemini.


The Two Tracks Inside Hermes

This is the diagram nobody draws when explaining Hermes:

Hermes Agent: Two Separate Tracks What actually improves — and what doesn't — when you use paid models TRACK 1 · Paid model path (Claude, GPT, Gemini) TRACK 2 · RL pipeline (local GPU only) User message any platform / CLI Memory retrieval MEMORY.md · USER.md · FTS5 Skill injection agentskills.io markdown docs Paid LLM API Claude · GPT-4 · Gemini · OpenRouter (weights never modified) Response + skill write-back if ≥5 tool calls → auto-generate skill updates memory ✓ Improves: memory · skills (prompt layer only) Trajectory collector JSONL export from agent runs Atropos API server rollout groups · reward scoring Tinker trainer GRPO · LoRA · local GPU required Local open-source model Qwen · Llama · Hermes-3 runs on vLLM / SGLang on your GPU LoRA adapter applied only affects the local model ✗ Irrelevant if using Claude / GPT / any paid API trajectories exported NO LINK LoRA ≠ paid API The Alchemist · mohdvasim.com

There are two completely independent systems running side by side.

Track 1 — The reasoning agent (where your paid model lives)

Every conversation goes through this path:

  1. User message arrives via CLI, Telegram, Slack, or any other gateway
  2. Relevant memory is retrieved from MEMORY.md and USER.md via FTS5 full-text search
  3. Applicable skills are injected into the system prompt
  4. The paid LLM API processes everything and produces a response
  5. If the task involved 5+ tool calls, the agent auto-generates or updates a skill file

This loop runs every session. It's the only thing that happens when you use Claude or GPT through Hermes.

Track 2 — The RL pipeline (completely separate)

This is Tinker/Atropos. It:

  • Collects execution trajectories as JSONL
  • Runs GRPO (Group Relative Policy Optimization) on them
  • Trains LoRA adapters
  • Applies those adapters to a locally-hosted open-source model running on vLLM or SGLang

The target model here is Qwen, Llama, Hermes-3, or any other open-weight model running on your own GPU. Not Claude. Not GPT. Not Gemini.


Why RL Is Irrelevant If You're Using Paid APIs

This is the core thing most Hermes write-ups miss entirely.

LoRA adapters are weight modifications. They are fine-tuned delta weights that sit on top of a base model. You apply them at inference time on a model you have direct access to — one running locally on your hardware via vLLM or SGLang.

You cannot apply a LoRA adapter to the Claude API. Or GPT-4. Or Gemini. You have no access to those model weights. There is no injection point.

So when Hermes trains a LoRA adapter from your conversation trajectories, that adapter improves a local open-source model — not the model doing the actual reasoning. The loop looks like this:

text
Your paid model (Claude/GPT) → collects trajectories
         ↓
Tinker-Atropos pipeline trains a LOCAL model (Qwen/Llama)
         ↓
LoRA adapter applied to LOCAL model only
         ↓
Your paid model: completely unchanged

If you don't have a local GPU setup running vLLM or SGLang, the entire RL pipeline is dead weight. You're not benefiting from it at all.

And there's another catch: training is never self-triggered. You explicitly run rl_start_training. The agent provides the data; a human presses the button. This is an offline pipeline wearing the skin of a tool, not a recursive self-improver.


What Actually Improves for Paid Model Users

If you're running Hermes with Claude, GPT-4, or any paid API, the real self-improvement mechanism is exactly two things:

1. Skills (the real workhorse)

After a complex task (5+ tool calls), Hermes captures the trajectory as a markdown skill file. Skills are stored in ~/.hermes/skills/ and follow the agentskills.io open standard — the same format used by Claude Code and Cursor.

Skills are data, not code. They are markdown files with YAML frontmatter, inline shell snippets, and parameter slots. They influence the prompt but don't modify the agent itself. You can open any skill in a text editor, read it, edit it, or delete it. That auditability matters.

Over time, this skill library compounds. Two developers running Hermes on identical hardware with identical models will have meaningfully different agents after six months — because their skill libraries reflect six months of their individual workflows. That's a real, concrete claim worth taking seriously.

2. Memory (bounded and disciplined)

Two files:

  • MEMORY.md — capped at ~2,200 characters (about 800 tokens)
  • USER.md — capped at ~1,375 characters (about 500 tokens)

The caps are tight on purpose. Cheap memory is a trap. Every token of recalled context is also a token of injection surface and a token that can't be cached.

What makes Hermes's memory implementation genuinely well-engineered:

  • File locking via fcntl — two agent processes don't clobber each other's writes
  • Frozen-snapshot pattern — memory is captured once at session start and doesn't mutate mid-session, preserving Anthropic prefix cache
  • Injection scanner on writes — memory content is scanned for prompt-injection patterns before being persisted
  • One external provider at a time — only one of Honcho, Mem0, Hindsight, etc. can run alongside built-in memory, preventing tool-schema bloat

The frozen-snapshot pattern in particular is the kind of thing most teams discover only after their billing triples. Mutating memory mid-session kills prefix caching. Hermes's architecture prevents this by design.


The Honest Assessment

Hermes is genuinely well-engineered — not because of the self-improvement marketing, but because of the operational discipline around the two mechanisms that actually matter:

  • Skills are a serious implementation of accumulated operational knowledge
  • Memory is cache-aware, injection-resistant, and bounded by design

The RL pipeline is real research infrastructure. But it's research infrastructure for teams training open-source models on their own compute. If that's you, great. If you're using Claude or GPT as most production teams are, the RL story doesn't apply at all.

What Hermes explicitly does not do for paid model users:

  • No autonomous source-code modification
  • No automatic prompt rewriting based on feedback
  • No self-grading loop (trajectories are stored as JSONL for external analysis, not consumed internally)
  • No weight modification of any kind

Read the marketing as: the agent meaningfully gets better over time without you hand-editing its Python source. That means skills plus bounded memory — and that's still a valuable system. It just isn't the recursive self-improver the framing implies.


Why This Matters for Agent Engineering

The Hermes case study surfaces a question that is underexplored in the field right now: what exactly is the harness, and how do you improve it?

Hermes answers: improve the skill and memory layer. The prompt gets better through accumulated workflow knowledge. That's real, but it's a narrow slice of what a harness is.

A full agent harness includes the system prompt, yes — but also tool definitions, memory configuration, orchestration policies, loop-breaking rules, failure recovery procedures, and runtime control logic. None of that evolves in Hermes. It stays static. Only the content injected into the prompt changes.

The more interesting question is whether an agent could safely propose modifications to its own tools, its own orchestration policies, its own failure recovery — not just its memory and skills. That would be qualitatively different from what Hermes is doing today. Whether it's even safe to let it try is a separate problem worth sitting with.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment