Rust · single binary · bounded memory
greatmemory
A fast, bounded-memory engine for AI agents. Remember, recall, and distill facts — in a single binary that stays in the hundreds of megabytes no matter how much you feed it.
Memory engines have a memory problem
The common pattern in agent-memory tools — local embeddings running in pooled WASM workers — grows multi-gigabyte heaps that are never returned to the operating system. Holding a few megabytes of notes can cost gigabytes of RAM. greatmemory was designed so that bug class cannot exist: one native ONNX model instance, a bounded ingestion queue with backpressure, and a fixed batch size. Peak memory is model weights plus one batch of activations — a constant, independent of how much you ingest. And it is not a promise on a landing page: it is a regression test that runs in CI and fails the build if it ever stops being true.
Measured, not claimed
Both charts below are drawn from real test runs — no synthetic numbers. The full methodology, raw output, and a one-line command to reproduce them on your own machine are under the charts.
Memory stays flat while data grows
Resident memory while ingesting 51.2 MB across 1,000 documents (31,000 chunks) through the full pipeline — real BGE-small embeddings, real SQLite. RSS at 2.6 MB ingested: ~536 MB. At 51.2 MB: ~541 MB.
Footprint at a glance
Idle, under full load, and the ceiling CI enforces on every push.
Evidence & how to reproduce
Full methodology →- The line chart is the verbatim output of
crates/gm-cli/tests/ram_regression.rsrun on 2026-06-12 (Apple M1 Pro, 16 GB, macOS). The corpus is generated from a fixed seed, so every run ingests the identical 51.2 MB. - The idle figure is
rss_bytesreported by the server's ownGET /v1/statsendpoint after boot in the published Docker image. - The same test runs in CI on every push and fails the build if peak RSS crosses 1 GB.
Reproduce it yourself, from the greatmemory source tree:
cargo test -p gm-cli --release -- --ignored ram_regression --nocapture
Show the raw test output behind these charts
[ram_regression] docs= 50 ingested= 2.6 MB rss= 535.9 MB peak= 535.9 MB [ram_regression] docs= 100 ingested= 5.1 MB rss= 530.5 MB peak= 535.9 MB [ram_regression] docs= 150 ingested= 7.7 MB rss= 529.3 MB peak= 535.9 MB [ram_regression] docs= 200 ingested= 10.2 MB rss= 535.8 MB peak= 535.9 MB [ram_regression] docs= 250 ingested= 12.8 MB rss= 512.6 MB peak= 535.9 MB [ram_regression] docs= 300 ingested= 15.4 MB rss= 578.1 MB peak= 578.1 MB [ram_regression] docs= 350 ingested= 17.9 MB rss= 545.5 MB peak= 578.1 MB [ram_regression] docs= 400 ingested= 20.5 MB rss= 547.7 MB peak= 578.1 MB [ram_regression] docs= 450 ingested= 23.1 MB rss= 520.9 MB peak= 578.1 MB [ram_regression] docs= 500 ingested= 25.6 MB rss= 545.8 MB peak= 578.1 MB [ram_regression] docs= 550 ingested= 28.2 MB rss= 550.6 MB peak= 578.1 MB [ram_regression] docs= 600 ingested= 30.7 MB rss= 543.4 MB peak= 578.1 MB [ram_regression] docs= 650 ingested= 33.3 MB rss= 556.5 MB peak= 578.1 MB [ram_regression] docs= 700 ingested= 35.9 MB rss= 543.7 MB peak= 578.1 MB [ram_regression] docs= 750 ingested= 38.4 MB rss= 530.4 MB peak= 578.1 MB [ram_regression] docs= 800 ingested= 41.0 MB rss= 526.9 MB peak= 578.1 MB [ram_regression] docs= 850 ingested= 43.6 MB rss= 494.9 MB peak= 578.1 MB [ram_regression] docs= 900 ingested= 46.1 MB rss= 560.0 MB peak= 578.1 MB [ram_regression] docs= 950 ingested= 48.7 MB rss= 588.1 MB peak= 588.1 MB [ram_regression] docs=1000 ingested= 51.2 MB rss= 541.3 MB peak= 588.1 MB [ram_regression] SUMMARY: ingested 51.2 MB across 1000 docs -> 31000 chunks; peak RSS 588.1 MB (limit 1024 MB) test ram_regression_50mb_ingest_stays_under_1gb ... ok test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2619.96s
Everything an agent needs to remember
Hybrid retrieval
Vector similarity + BM25 full-text + extracted facts, fused with reciprocal rank fusion and boosted by recency.
Fact distillation
An optional LLM (Ollama or any OpenAI-compatible API) extracts subject–predicate–object facts; contradictions supersede with a full audit trail.
MCP built in
remember, recall, get_context, get_profile, forget — over stdio for Claude Code, Codex, OpenClaw, and Hermes, or streamable HTTP.
Local-first, server-ready
SQLite by default — the whole store is one file. Postgres + pgvector for servers via a single GM_DB URL.
REST + OpenAPI
The spec is generated from the serving code at /v1/openapi.json, so generated clients can never drift.
Private by default
Binds to loopback, no accounts, no cloud. Your memories live in a file you own. Offline after one model download.
Running in sixty seconds
# from the greatmemory source tree cargo install --path crates/gm-cli gmem serve # local-first: loopback + SQLite gmem add "Staging runs Postgres 17 with pgvector." gmem search "what does staging run?" claude mcp add greatmemory -- gmem mcp # persistent memory for Claude Code
Prefer containers? docker compose up in the repo gives you the same server in a 158 MB image.
greatmemory is built and maintained by Venkatesh Balakumar at Amzu Information Technology Ltd — born out of watching a 5 MB note collection eat nearly 8 GB of RAM, and deciding the fix was a rewrite in Rust with the memory bound enforced by CI rather than promised in a changelog.