How we built Engrava: from cognitive-architecture research to a production library

April 23, 2026 Sovantica 11 min read

A deterministic dreaming cycle, a graph store that runs in your Python process, and the research behind why agent memory needs both.

You’re building an agent. It answers questions across many sessions. By session three it’s forgotten what it learned in session one. You stick a vector DB in front of it. Now it remembers a blurry average of everything, retrieved by cosine distance. Users notice it contradicts itself. It can’t tell a three-week-old preference from a stale throwaway.

You try a graph database. Neo4j, real schema, relationships. Now you’re running a separate service with its own auth model. Your install steps doubled. Your latency doubled. You’re querying Cypher from Python.

You try one of the managed memory SaaS services. You hand your agent’s entire thought-stream to a third party. Your data sits on someone else’s disk. Pricing is per-operation or per-token, so costs scale with the thing your agent is doing constantly.

The current top of the stack — Mem0 — has an open issue (#4573) where users report 97.8% of extracted memories are junk. The reason isn’t bad engineering. It’s that LLM-based extraction asked “Is this worth remembering?” and the LLM always says yes. The signal gets drowned by confident noise.

So agent memory today is one of: too flat to capture structure, too heavy to embed, too rented to keep, or too noisy to trust.

We wanted something else. Local. Structured. Deterministic. Able to forget.

How dreaming works

Every thought you store in Engrava has a priority. The priority isn’t set once at write time. It’s recomputed every time the dreaming cycle runs. The cycle is a single deterministic pass over the thoughts in the store, and it takes five signals.

Recency. How long since this thought was last read or updated. Not a hard TTL — a decay curve. Fresh things get a boost; stale things don’t.

Staleness. The span between when a thought was created and when it was last touched. A thought that stays relevant over a long lifetime weighs differently from one that spiked once and went quiet.

Confirmation. How many times the agent has re-observed the same fact and confirmed it still holds. Repeated confirmation is evidence a memory is durable, not incidental.

Confidence. Stored alongside every thought as a field in the range 0–1. Set when the thought is created, updated when the agent re-observes the same fact. High confidence strengthens the signal. Low confidence fades faster.

Frequency. How many times the agent has actually touched this thought in retrieval. A fact the agent references twenty times a day matters more than a fact it hasn’t looked at in a month.

The five signals combine into a single priority score. The score is checked against a set of gates before anything is promoted:

dreaming:
  enabled: true
  signals: [recency, staleness, confirmation, confidence, frequency]
  promote_threshold: 0.75   # composite score above this → eligible for promotion
  gates:
    min_confirmations: 2     # must be re-observed at least this many times
    max_promoted_per_run: 20 # cap on promotions per cycle

A thought is promoted only if its composite score clears promote_threshold and it passes the gates — at minimum a re-observation count and a per-cycle cap, so a single noisy cycle can’t flood the hot set. Promoted thoughts sit in the active set: they show up earlier in hybrid search and get higher weight in relevance scoring. Everything else stays exactly where it is, fully retrievable, with its audit trail intact — nothing is silently dropped.

What this is not: no LLM, no network calls, no embedding re-computation. The dreaming cycle is a linear pass over rows in SQLite with deterministic arithmetic. Same inputs, same outputs, every time. On a store with 100,000 thoughts, it runs in under a second on a laptop.

Why “dreaming”? Because that’s what sleep does to memory — consolidation decides what gets promoted, what gets reshaped, what gets dropped. We’ll get to that next.

You configure it in YAML. You version-control your YAML. Your memory policy is code review-able.

Why dreaming exists at all

The dreaming cycle isn’t an aesthetic metaphor. It’s adapted from three concrete research traditions.

Sleep consolidation. Memory in biological systems isn’t flat — it’s layered. Experiences first land in the hippocampus as episodic traces, then, during periods of low sensory input (mostly sleep), selected traces are replayed and propagated to neocortex as semantic knowledge. The replay is selective, not exhaustive. It prefers traces that are surprising, recent, or emotionally weighted. The rest fade. The synthesis paper that names this dynamic most directly is Diekelmann & Born, “The memory function of sleep” (2010) — a review of decades of rodent and human studies showing that sleep selectively stabilizes what matters and drops what doesn’t. The pattern is selective consolidation driven by valence and recency, not write-once storage.

Hippocampal pattern separation. If you store ten similar experiences the same way, they collapse into one averaged memory. You lose the ability to tell them apart. The hippocampus solves this with pattern separation — a mechanism that pushes similar inputs into distinct representations rather than averaging them. Yassa & Stark, “Pattern separation in the hippocampus” (2011), is the canonical synthesis. The engineering lesson: don’t let a vector DB’s nearest-neighbor collapse erase the distinctions your agent actually needs. Keep the structure, even when it’s tempting to compress.

Predictive coding. Why not weight every memory equally? Because brains don’t — they preferentially stabilize what gets confirmed against repeated experience. Rao & Ballard, “Predictive coding in the visual cortex” (1999), framed this for perception; Andy Clark’s “Whatever next?” (2013) extended it to cognition. The shorthand: the brain is a prediction engine, and the memories worth keeping are the ones experience keeps validating. That’s the semantics we want for an agent. A fact the agent re-observes and re-confirms earns its place in the hot set; one that’s never touched again drifts down the priority order on its own.

Map those three into engineering choices:

Selective consolidation → the priority score plus promotion gates decide what gets lifted into the active set each cycle.
Pattern separation → thoughts stay distinct as nodes; similarity doesn’t auto-merge them.
Predictive coding → the confirmation and confidence signals reward facts that experience keeps validating over ones touched once and forgotten.

None of this requires claiming we built a brain. We didn’t. What we did is take the selection pressures that biological memory evolved against — finite substrate, noisy input, the need to retrieve quickly — and apply the same selection pressures to an on-disk store. The mechanism is simple arithmetic. The inspiration is well-cited. The three papers above are each under 30 pages and remarkably approachable if you want to read the primary sources.

Why we didn’t pick a vector DB

We looked at every layer of the existing stack.

Vector DBs like Chroma, Pinecone, pgvector. Great at “find documents semantically similar to this query.” Blind to structure. A vector DB doesn’t know that thought A caused thought B, or that thought C is a specialization of thought D. The agent ends up with a flat bag of embeddings and reasons over semantic distance alone. Useful for retrieval. Not enough for an agent.

Graph DBs like Neo4j, Graphiti, ArangoDB. Great at structure. The cost is operational. You run a separate service with its own query language (Cypher or similar), its own persistence, its own auth. Your deployment diagram grows a node. Your latency picks up a network round-trip. Self-hosting is possible; embedding the service inside your Python process is not.

Managed memory SaaS like Mem0 and Zep. Lower friction at the top — a few lines of SDK get you started. The cost is architectural: every memory operation is a network call, your agent’s private state sits in someone else’s datacenter, and pricing scales with a thing your agent is doing constantly. Mem0’s open-source SDK no longer ships a graph at all — its self-host path is now a separate server. Zep’s graph runs through Graphiti and its hosted tiers are credit-metered. Engrava keeps the whole graph in a single embedded file, free.

Engrava sits between. Graph from day one — first-class edges, MindQL traversal, no external service. Embedded — no separate process, no separate auth model, no separate deployment. The comparison matrix on the landing page makes the shape of this tradeoff explicit.

SQLite-inspired embedded philosophy

SQLite is in most devices you own — every iPhone, every Android, every browser. Its defining decision was embedding: it runs inside the host process, writes to a single file, ships as one library, costs zero operations. It’s also the most-tested database in the world.

We stole the posture.

Engrava is a Python library. pip install engrava. The store is a SQLite file on your disk. There is no Engrava server to deploy, no port to open, no credentials to rotate, no separate auth system to wire up. Your agent imports it the way it imports any other library.

Concretely that means:

Zero external infrastructure. No Redis, no Neo4j, no Postgres, no managed vector DB. Your deployment diagram is the same before and after adding Engrava.
Zero egress. Data never leaves the host your agent runs on. Use it inside air-gapped environments, regulated ones, or on a single-board computer.
Zero credits. No per-operation pricing. You install it once; it runs forever. MIT-licensed. No rug-pull economics.
Zero cold starts. Opening a SQLite file takes milliseconds. No connection pool, no handshake, no warmup.

The tradeoff: Engrava is not a multi-tenant fleet service. If your agent is a horizontally scaled cluster that needs shared memory across machines, you’ll need a different tool — or run a shared Engrava process behind a thin RPC, which is outside current scope. For single-agent workloads and vertical scale, embedded is strictly better.

How we worked: research → requirements → code

We didn’t start with code. We started with questions.

Two years of upstream research on cognitive architectures for long-running agents turned into a formal requirements document: 103 requirements across eight sections — functional, performance, data, architecture, security, ops, UX, and miscellaneous — each traceable back to the source it came from: a paper, an experiment, a failed prototype, a conversation.

Those requirements drove 23 Architecture Decision Records. Each ADR names the decision, the alternatives considered, the reasoning, and the expected cost. One ADR is why REFLECTION thoughts come from label-propagation clusters over the association graph, rather than flat similarity averaging. Another is why engrava ships as an open-source package on PyPI, not coupled to a proprietary agent runtime. If you ever want to know why Engrava is the shape it is, the ADRs give you the audit trail.

Underneath those, 500+ Q&A records — every ambiguity we raised during design, every answer we converged on, every contradiction we left open. When we disagreed internally, we wrote it down. When an LLM reviewer disagreed, we wrote that down too. The Q&A format forced the point: if you can’t state the question precisely, you haven’t earned the right to state the answer.

Then implementation: 269 source files, 2,696 tests covering the behaviors each requirement specified. Every public method in the API maps back to a requirement. Every requirement has at least one test. When a test fails, grep tells you which REQ is broken.

This isn’t a methodology flex — it’s an audit posture. You should be able to ask “why does Engrava do X?” and get an answer that doesn’t bottom out in “someone preferred it that way.” The paper trail runs all the way up. If you want to fork Engrava and take it in a different direction, the trail is your starting point.

The tamper-evident audit trail inside the store — the SHA-256 hash chain that the journal: config block turns on — is the runtime mirror of this build-time posture. Transparency isn’t a marketing word. It’s a pipeline we already had to run to ship the thing.

What’s next

We’re shipping the Free tier first. Engrava is live on PyPI. Graph, dreaming, hybrid search, audit, MindQL — all in.

What’s coming:

Benchmarks — a public rig comparing Engrava against Mem0, Zep, and ChromaDB on retrieval quality and latency. Reproducible, apples-to-apples, not cherry-picked configs.
LLM extensions — optional hooks for LLM-assisted relationship extraction and summary consolidation, off by default, for teams that want them.
Graph embeddings — vector representations of subgraphs, not just nodes, for retrieval that respects structure.

If you’re building an agent right now, the fastest way to find out whether Engrava fits is to try it.

pip install engrava

Star the repo on GitHub. Open a discussion if you want a feature. Open an issue if something breaks. We read everything.

Tags: engrava · memory · agents · architecture

← Back to engrava.ai