← engrava.ai

How we built Engrava: from cognitive-architecture research to a production library

Sovantica 11 min read

A deterministic dreaming cycle, a graph store that runs in your Python process, and the research behind why agent memory needs both.

You’re building an agent. It answers questions across many sessions. By session three it’s forgotten what it learned in session one. You stick a vector DB in front of it. Now it remembers a blurry average of everything, retrieved by cosine distance. Users notice it contradicts itself. It can’t tell a three-week-old preference from a stale throwaway.

You try a graph database. Neo4j, real schema, relationships. Now you’re running a separate service with its own auth model. Your install steps doubled. Your latency doubled. You’re querying Cypher from Python.

You try one of the managed memory SaaS services. You hand your agent’s entire thought-stream to a third party. Your data sits on someone else’s disk. Pricing is per-operation or per-token, so costs scale with the thing your agent is doing constantly.

The current top of the stack — Mem0 — has an open issue (#4573) where users report 97.8% of extracted memories are junk. The reason isn’t bad engineering. It’s that LLM-based extraction asked “Is this worth remembering?” and the LLM always says yes. The signal gets drowned by confident noise.

So agent memory today is one of: too flat to capture structure, too heavy to embed, too rented to keep, or too noisy to trust.

We wanted something else. Local. Structured. Deterministic. Able to forget.

How dreaming works

Every thought you store in Engrava has a priority. The priority isn’t set once at write time. It’s recomputed every time the dreaming cycle runs. The cycle is a single deterministic pass over the thoughts in the store, and it takes four signals.

Recency. How long since this thought was last read or updated. Not a hard TTL — a decay curve. Fresh things get a boost; stale things don’t.

Frequency. How many times the agent has actually touched this thought in retrieval. A fact the agent references twenty times a day matters more than a fact it hasn’t looked at in a month.

Confidence. Stored alongside every thought as a field in the range 0–1. Set when the thought is created, updated when the agent re-observes the same fact. High confidence strengthens the signal. Low confidence fades faster.

Emotional charge. The Agent Affect Signals (AAS) — 16 algorithmic flags like surprise, contradiction, importance, produced by the store, not by an LLM. A thought that was surprising when written gets weighted heavier than a neutral one.

The four signals combine into a single priority score. The score passes through three gates:

dreaming:
  enabled: true
  signals: [recency, frequency, confidence, emotional_charge]
  promote_threshold: 0.75  # scores above → promoted to ACTIVE
  fade_threshold: 0.2      # scores below → faded (lower visibility)
  archive_threshold: 0.05  # scores far below → archived

Above the promote threshold, a thought is promoted: it sits in the hot set, shows up earlier in hybrid search, gets higher weight in relevance scoring. Below the fade threshold, a thought fades — still retrievable, still searchable, but deprioritized. If it never comes back up through frequency or confidence updates, it drifts toward archive. Below the archive threshold, a thought is archived: it stops appearing in the default search surface but stays in the store with its audit trail intact. You can always restore it.

What this is not: no LLM, no network calls, no embedding re-computation. The dreaming cycle is a linear pass over rows in SQLite with deterministic arithmetic. Same inputs, same outputs, every time. On a store with 100,000 thoughts, it runs in under a second on a laptop.

Why “dreaming”? Because that’s what sleep does to memory — consolidation decides what gets promoted, what gets reshaped, what gets dropped. We’ll get to that next.

You configure it in YAML. You version-control your YAML. Your memory policy is code review-able.

Why dreaming exists at all

The dreaming cycle isn’t an aesthetic metaphor. It’s adapted from three concrete research traditions.

Sleep consolidation. Memory in biological systems isn’t flat — it’s layered. Experiences first land in the hippocampus as episodic traces, then, during periods of low sensory input (mostly sleep), selected traces are replayed and propagated to neocortex as semantic knowledge. The replay is selective, not exhaustive. It prefers traces that are surprising, recent, or emotionally weighted. The rest fade. The synthesis paper that names this dynamic most directly is Diekelmann & Born, “The memory function of sleep” (2010) — a review of decades of rodent and human studies showing that sleep selectively stabilizes what matters and drops what doesn’t. The pattern is selective consolidation driven by valence and recency, not write-once storage.

Hippocampal pattern separation. If you store ten similar experiences the same way, they collapse into one averaged memory. You lose the ability to tell them apart. The hippocampus solves this with pattern separation — a mechanism that pushes similar inputs into distinct representations rather than averaging them. Yassa & Stark, “Pattern separation in the hippocampus” (2011), is the canonical synthesis. The engineering lesson: don’t let a vector DB’s nearest-neighbor collapse erase the distinctions your agent actually needs. Keep the structure, even when it’s tempting to compress.

Predictive coding. Why weight memories by surprise? Because brains don’t store everything uniformly — they preferentially encode what violates expectation. Rao & Ballard, “Predictive coding in the visual cortex” (1999), framed this for perception; Andy Clark’s “Whatever next?” (2013) extended it to cognition. The shorthand: the brain is a prediction engine; surprise is the signal that prediction failed and the model needs updating. That’s exactly the semantics we want for an agent. If the world behaved as expected, we don’t need to lift the memory into the hot set. If it didn’t, we do.

Map those three into engineering choices:

  • Selective consolidation → the three-gate promote / fade / archive structure.
  • Pattern separation → thoughts stay distinct as nodes; similarity doesn’t auto-merge them.
  • Predictive coding → the emotional_charge signal rewards surprise over routine.

None of this requires claiming we built a brain. We didn’t. What we did is take the selection pressures that biological memory evolved against — finite substrate, noisy input, the need to retrieve quickly — and apply the same selection pressures to an on-disk store. The mechanism is simple arithmetic. The inspiration is well-cited. The three papers above are each under 30 pages and remarkably approachable if you want to read the primary sources.

Why we didn’t pick a vector DB

We looked at every layer of the existing stack.

Vector DBs like Chroma, Pinecone, pgvector. Great at “find documents semantically similar to this query.” Blind to structure. A vector DB doesn’t know that thought A caused thought B, or that thought C is a specialization of thought D. The agent ends up with a flat bag of embeddings and reasons over semantic distance alone. Useful for retrieval. Not enough for an agent.

Graph DBs like Neo4j, Graphiti, ArangoDB. Great at structure. The cost is operational. You run a separate service with its own query language (Cypher or similar), its own persistence, its own auth. Your deployment diagram grows a node. Your latency picks up a network round-trip. Self-hosting is possible; embedding the service inside your Python process is not.

Managed memory SaaS like Mem0 and Zep. Lower friction at the top — a few lines of SDK get you started. The cost is architectural: every memory operation is a network call, your agent’s private state sits in someone else’s datacenter, and pricing scales with a thing your agent is doing constantly. Mem0 paywalls graph memory behind a Pro tier at $249/mo. Zep is credit-metered from $25/mo.

Engrava sits between. Graph from day one — first-class edges, MindQL traversal, no external service. Embedded — no separate process, no separate auth model, no separate deployment. The comparison matrix on the landing page makes the shape of this tradeoff explicit.

SQLite-inspired embedded philosophy

SQLite is in most devices you own — every iPhone, every Android, every browser. Its defining decision was embedding: it runs inside the host process, writes to a single file, ships as one library, costs zero operations. It’s also the most-tested database in the world.

We stole the posture.

Engrava is a Python library. pip install engrava. The store is a SQLite file on your disk. There is no Engrava server to deploy, no port to open, no credentials to rotate, no separate auth system to wire up. Your agent imports it the way it imports any other library.

Concretely that means:

  • Zero external infrastructure. No Redis, no Neo4j, no Postgres, no managed vector DB. Your deployment diagram is the same before and after adding Engrava.
  • Zero egress. Data never leaves the host your agent runs on. Use it inside air-gapped environments, regulated ones, or on a single-board computer.
  • Zero credits. No per-operation pricing. You install it once; it runs forever. MIT-licensed. No rug-pull economics.
  • Zero cold starts. Opening a SQLite file takes milliseconds. No connection pool, no handshake, no warmup.

The tradeoff: Engrava is not a multi-tenant fleet service. If your agent is a horizontally scaled cluster that needs shared memory across machines, you’ll need a different tool — or run a shared Engrava process behind a thin RPC, which is outside current scope. For single-agent workloads and vertical scale, embedded is strictly better.

How we worked: research → requirements → code

We didn’t start with code. We started with questions.

Two years of upstream research on cognitive architectures for long-running agents turned into a formal requirements document: 103 requirements across eight sections — functional, performance, data, architecture, security, ops, UX, and miscellaneous — each traceable back to the source it came from: a paper, an experiment, a failed prototype, a conversation.

Those requirements drove 23 Architecture Decision Records. Each ADR names the decision, the alternatives considered, the reasoning, and the expected cost. ADR-005 is why SQLite is the store and not Postgres. ADR-011 is why the color system is Aurum Journal. ADR-018 is why emotions were replaced by the 16 Agent Affect Signals. If you ever want to know why Engrava is the shape it is, the ADRs give you the audit trail.

Underneath those, 500+ Q&A records — every ambiguity we raised during design, every answer we converged on, every contradiction we left open. When we disagreed internally, we wrote it down. When an LLM reviewer disagreed, we wrote that down too. The Q&A format forced the point: if you can’t state the question precisely, you haven’t earned the right to state the answer.

Then implementation: 269 source files, 2,696 tests covering the behaviors each requirement specified. Every public method in the API maps back to a requirement. Every requirement has at least one test. When a test fails, grep tells you which REQ is broken.

This isn’t a methodology flex — it’s an audit posture. You should be able to ask “why does Engrava do X?” and get an answer that doesn’t bottom out in “someone preferred it that way.” The paper trail runs all the way up. If you want to fork Engrava and take it in a different direction, the trail is your starting point.

The tamper-evident audit trail inside the store — the SHA-256 hash chain that the journal: config block turns on — is the runtime mirror of this build-time posture. Transparency isn’t a marketing word. It’s a pipeline we already had to run to ship the thing.

What’s next

We’re shipping the Free tier first. Engrava v0.2.0 is live on PyPI. Graph, dreaming, hybrid search, audit, MindQL — all in.

What’s coming:

  • Benchmarks — a public rig comparing Engrava against Mem0, Zep, and ChromaDB on retrieval quality and latency. Reproducible, apples-to-apples, not cherry-picked configs.
  • LLM extensions — optional hooks for LLM-assisted relationship extraction and summary consolidation, off by default, for teams that want them.
  • Graph embeddings — vector representations of subgraphs, not just nodes, for retrieval that respects structure.

If you’re building an agent right now, the fastest way to find out whether Engrava fits is to try it.

pip install engrava

Star the repo on GitHub. Open a discussion if you want a feature. Open an issue if something breaks. We read everything.

Tags: engrava · memory · agents · architecture

← Back to engrava.ai