Observability

engrava exposes a snapshot metrics API via await store.metrics(). The returned EngravaMetrics dataclass aggregates thought/edge counts, storage footprint, and a rolling-window search-latency histogram.

Quick Example

from engrava import SqliteEngravaCore
import aiosqlite

async def main() -> None:
    conn = await aiosqlite.connect("engrava.db")
    conn.row_factory = aiosqlite.Row
    store = SqliteEngravaCore(conn)
    await store.ensure_schema()
    try:
        metrics = await store.metrics()
        print(metrics.thoughts.total)
        print(metrics.edges.by_type)
        print(metrics.search_latency.p95_ms)
    finally:
        await conn.close()

Metrics Fields

store.metrics() returns a stable EngravaMetrics dataclass with:

  • thoughts — counts by type and lifecycle status
  • edges — counts by edge type
  • storage — on-disk footprint for the main SQLite database and WAL
  • search_latency — rolling-window p50/p95/p99 search latency

Configuration

metrics:
  enabled: true
  window_size: 1000

When enabled: false, store.metrics() returns a zero-filled snapshot and does not issue SQL queries.

CLI

engrava info renders the same snapshot contract used by the Python API:

engrava --db mydata.db info
engrava --db mydata.db --format json info

Notes

  • The latency histogram tracks completed public search calls.
  • Nested calls inside search_hybrid() are suppressed, so one hybrid search contributes one latency sample.
  • This snapshot API tracks only aggregate counts and search latency — not individual events.

Production monitoring

store.metrics() is a pull snapshot — there is no built-in exporter. To monitor a deployment, scrape the snapshot on an interval and feed the fields into your metrics system (Prometheus, OpenTelemetry, StatsD, …).

Exporting the snapshot

The snapshot is a plain dataclass, so mapping it to any client is straightforward. A Prometheus example:

from prometheus_client import Gauge

THOUGHTS = Gauge("engrava_thoughts_total", "Total thoughts")
DB_BYTES = Gauge("engrava_db_bytes", "Main database size in bytes")
WAL_BYTES = Gauge("engrava_wal_bytes", "WAL size in bytes")
SEARCH_P95 = Gauge("engrava_search_p95_ms", "Search p95 latency (ms)")
SEARCH_P99 = Gauge("engrava_search_p99_ms", "Search p99 latency (ms)")


async def collect(store) -> None:
    m = await store.metrics()
    THOUGHTS.set(m.thoughts.total)
    DB_BYTES.set(m.storage.db_bytes)
    WAL_BYTES.set(m.storage.wal_bytes)
    SEARCH_P95.set(m.search_latency.p95_ms)
    SEARCH_P99.set(m.search_latency.p99_ms)

The full field set on EngravaMetrics: thoughts (total, by_type, by_status), edges (total, by_type), storage (db_bytes, wal_bytes, vec_index_bytes, total_bytes), and search_latency (sample_count, p50_ms, p95_ms, p99_ms, min_ms, max_ms, mean_ms). The snapshot also carries schema_version and snapshot_timestamp.

Scrape cadence

Treat metrics() like any pull endpoint: a 30–60 s scrape interval is typically plenty. Counts and storage change slowly, and the latency histogram is a rolling window (metrics.window_size, default 1000 samples) that already smooths short spikes. Avoid sub-second scrapes — each call runs a few aggregate SQL queries.

What to alert on

SignalSource fieldAlert when…
Storage growthstorage.db_bytes, storage.total_bytessize approaches your disk budget, or grows unexpectedly fast
WAL not checkpointingstorage.wal_bytesthe WAL keeps growing and never shrinks (checkpoints not happening)
Search latencysearch_latency.p95_ms / p99_msp95/p99 exceeds your budget
Expired backlogcount_thoughts(include_expired=True)count_thoughts()the number of expired-but-not-cleaned thoughts grows (run engrava gc --expired)
Audit integritystore.journal.verify_integrity() (journaling only)the chain fails verification (tampering or corruption)

The expired-backlog and audit-integrity signals are not in the metrics snapshot — compute them from the calls shown above on your own cadence.

The audit-integrity check applies only when journaling is enabled (journal.enabled: true — see Configuration → journal). With journaling off, store.journal is None, so guard the call:

async def journal_ok(store) -> bool:
    if store.journal is None:
        return True  # journaling disabled — nothing to verify
    result = await store.journal.verify_integrity()
    return result.valid

Health check

For a readiness probe you want a call that actually touches the database. Note that metrics() is not reliable for this when metrics are disabled: with metrics.enabled: false, store.metrics() returns a zero-filled snapshot without issuing any SQL, so it would report healthy even if the database were unreadable. Use a lightweight real read instead — count_thoughts() always queries the database (independent of the metrics setting):

async def healthcheck(store) -> bool:
    try:
        await store.count_thoughts()  # issues SQL — confirms DB + schema are readable
    except Exception:
        return False
    return True

(If you know metrics are enabled in your deployment, await store.metrics() works too and additionally returns the live counts.)

Logging

The library logs through the standard logging module under the engrava.* namespace (each module uses logging.getLogger(__name__), e.g. engrava.extensions.dreaming, engrava.config). It logs at WARNING (degraded conditions, e.g. sqlite-vec unavailable → numpy fallback), INFO (dreaming progress), and DEBUG (detailed internals) — it does not log at ERROR/CRITICAL; failures are raised as typed exceptions for the caller to handle. Configure it like any library logger:

import logging

logging.getLogger("engrava").setLevel(logging.WARNING)  # quiet, production default
# logging.getLogger("engrava").setLevel(logging.INFO)   # see dreaming activity

Out of scope

The snapshot is deliberately small. It does not include:

  • write / mutation counters or error counters — track those at your application layer (Engrava raises typed exceptions you can count there);
  • dreaming metricsrun_consolidation() returns a ConsolidationResult (promoted / edges / reflections counts) per run; consume that directly;
  • journal size or per-event audit metrics — the audit history lives in the journal itself, which you query and verify directly, not via the metrics snapshot.