2026-05-30 — Dan Billings

Persistent memory for a local LLM: Honcho on Hermes Agent

Dan Billings — 2026-05-30

This is a post about giving an AI agent a real memory system. Not a chat history buffer. Not RAG over old transcripts. An actual persistent memory that builds a model of you over time, notices patterns, and can answer questions like "what do I keep getting stuck on?" without you having to remember to ask.

The project is Hermes Agent pointed at a self-hosted Honcho instance on danarch, my RTX 3070 Linux box. DeepSeek v4 Pro is the workhorse model driving the agent — cheap enough that I don't think about token cost, capable enough that I don't think about capability gaps. This is about plumbing those pieces together and what broke along the way.

Why Honcho

Honcho is an open-source memory server from honcho.co. It has a key idea that I haven't seen done well elsewhere: dialectical reasoning. Most memory systems just dump facts into a vector store and retrieve the top-K at query time. Honcho runs a background worker (the "deriver") that does deduction and induction. It looks at everything the agent observes and asks: what follows from this? What generalizations can I form? What contradictions do I need to reconcile?

The word choice is deliberate. Aristotle's dialectic was reasoning from generally accepted opinions toward something more stable — thesis, antithesis, synthesis. Honcho calls it "dreaming": the deriver wakes up periodically, looks at the accumulated observations, and reasons through them. The output is conclusions — durable statements about you that survive across sessions. Some are facts ("Dan works at Foundation Medicine"). Some are patterns ("Dan prefers concise responses and gets annoyed at filler"). Some are corrections of older conclusions that turned out to be wrong.

An agent with working memory can do things a stateless agent cannot: remember your preferences without you restating them, track projects across sessions without a TODO file, notice when you're stuck on the same class of problem and preempt it. The dynein actuator in a working muscle.

The alternative is context-stuffing: shove everything into the prompt until you hit the limit, summarize, repeat. That's a lossy compression scheme, not memory. Honcho gives you actual retrieval and actual inference over what it retrieves.

The setup

The pieces:

danarch: Arch Linux, RTX 4090 (24 GB VRAM), 64 GB RAM. Already running llama-server for Gemma 4 12B on port 8080.
Honcho: FastAPI server on port 8000, deriver worker, Postgres with pgvector for embedding storage.
nomic-embed: llama-server instance on port 8081 serving nomic-embed-text-v1.5 at Q4_K_M quantization for message and document embeddings.
Hermes Agent: running on dans-mac-mini, pointed at danarch:8000 for Honcho, danwin:8080 for LLM inference.

All of this is managed through ansible-scala — typed Scala 3 playbooks, not shell scripts. The playbook for danarch is danarch4090.scala, the Honcho module is Honcho.scala. If you want the exact systemd units and config files, they're in there. For an explanation of the typed DSL architecture, see Type-Safe Home Cluster.

What broke: the interesting parts

Getting this running wasn't a clean systemctl start. There were four real problems, each with a lesson.

1. VRAM contention: the embedding model and the LLM on the same GPU

RTX 3070 Ti has 8 GB of VRAM. Qwen3-8B at 65K context with Q4_K_M quantization uses all of it. When I tried to run nomic-embed on the same GPU with -ngl 99 (offload all layers), CUDA OOMed immediately:

cudaMalloc failed: out of memory

The fix wasn't to reduce Qwen3's context. The fix was to run nomic-embed on CPU: -ngl 0. nomic-embed-text-v1.5 is a 137M parameter model. For embedding workloads, which are one-shot forward passes at short context (2048 tokens), it's fast enough on CPU. The alternative — cutting Qwen3's context to free VRAM — would have made the main LLM worse. Don't optimize the auxiliary system at the expense of the primary one.

2. Embedding dimension mismatch: 768 ≠ 1536

Honcho defaults to 1536-dimensional vectors because it assumes OpenAI's text-embedding-ada-002. nomic-embed produces 768-dimensional vectors. If you don't tell Honcho about the dimension mismatch, it will store the first 768 floats and leave the rest as garbage, or reject the vectors outright. In this case it rejected them:

Embedding dimension mismatch for openai:nomic-embed-text-v1.5. Expected 1536, got 768.

Fix: set EMBEDDING_VECTOR_DIMENSIONS=768 in Honcho's .env file, then run the included migration script to alter the pgvector columns:

uv run python scripts/configure_embeddings.py --yes

This drops the HNSW indices, alters both documents.embedding and message_embeddings.embedding from vector(1536) to vector(768), and recreates the indices. It also refuses to run if any non-null embeddings already exist — you migrate the schema first, then populate it.

You only learn this by reading the error message from the conclusions endpoint. The health check passes fine. Embeddings just silently fail until something tries to write a conclusion and the dimension check fires.

3. Port conflicts from the old manual process

Before systemd, Honcho was a nohup'd fastapi run that I'd started by hand and forgotten about. When the systemd unit tried to bind port 8000, it failed with an opaque exit code 3. The old process (PID 3194611) was still running. Same deal with the deriver — there was a duplicate running from the old nohup that the systemd unit didn't know about.

Lesson: if you're migrating from ad-hoc to systemd, kill the ad-hoc processes first. Not a deep lesson, but the kind that costs you 20 minutes when you skip it.

4. The Hermes Honcho tools are targeting old API paths

Honcho recently upgraded from v2 to v3. The Hermes Agent built-in tools (honcho_conclude, honcho_search, etc.) are still hitting the old endpoints. They return "Failed to save conclusion" with no useful error body. In the meantime, the direct v3 API works fine:

POST /v3/workspaces/{id}/conclusions

This is the one that needs an upstream fix in Hermes. For now, I wrote conclusions via curl and let the deriver pick them up.

The DeepSeek piece

I'm running this through DeepSeek v4 Pro. Not because it's the smartest model available — it's not, Claude and the latest OpenAIs still edge it out on hard reasoning tasks. But for the kind of work an agent does — decomposition, tool selection, file editing, summarizing, deciding what's worth remembering — it's more than good enough. And the token cost is low enough that I don't have to think about it.

Most of the work in setting up a system like this isn't a reasoning challenge. It's a coordination challenge: remembering what port something is on, checking logs, editing config files, testing, iterating. An expensive model doesn't help with that. A cheap model that reliably does what you ask does.

What the deriver actually produces

After writing the initial conclusions (fleet layout, coding style, work background, infrastructure), the deriver starts building on them. Early examples:

"Dan runs a fleet of local AI inference machines" → deriver notes the pattern (consumer GPUs, open-source models, self-hosted) and stores it as a durable fact.
"Dan's Scala 3 style: Iron refinements, 2-value enums, no booleans" → deriver picks this up and the agent starts defaulting to that style without being asked.

The dialectic layer is what makes this different from a vector search. The deriver doesn't just retrieve relevant facts — it synthesizes new ones. "Dan uses x for y" might be a direct observation. "Dan prefers native tooling over containers across all his systems" is an induction the deriver forms after seeing enough examples.

This is still early. The deriver needs a few sessions of material to work with before the inductions get interesting. But the architecture — store observations, reason over them, produce durable conclusions — is the right shape.

The playbook

If you want to reproduce this, the ansible-scala playbook is at:

src/main/scala/ansible/examples/danarch4090.scala
src/main/scala/ansible/examples/dumpdanarch4090.scala
src/main/scala/ansible/modules/honcho/Honcho.scala

It handles: cloning the repo, setting up the venv, writing the .env with the correct embedding dimensions, creating three systemd units (honcho-server, honcho-deriver, nomic-embed), downloading the embedding model, enabling everything, and configuring pgvector. The EMBEDDING_VECTOR_DIMENSIONS=768 fix is at commit 83cbbef.

Running it:

git clone https://github.com/danbills/ansible-scala /home/dan/ansible-scala
cd /home/dan/ansible-scala
sbt "runMain ansible.examples.Danarch4090"

The playbook targets localhost — it runs on the machine it's configuring. All you need is sbt, Scala 3, and a Postgres instance with the pgvector extension.

What's next

The obvious next step is making the Hermes Honcho tools work with the v3 API so I'm not writing conclusions via curl. After that: letting the deriver accumulate enough material that the agent starts anticipating rather than just responding. And possibly feeding the session transcripts back into Honcho so the deriver can analyze the agent's own behavior — recursive self-improvement as a side effect of having memory.

The cheap model makes this feasible. When you're paying pennies per thousand tokens, you can afford to let the deriver run arbitrary deduction queries in the background, and you can afford to let the agent introspect on past sessions. The compute budget stops being the limiting factor.

Repository: danbills/ansible-scala

← All writings · Home