Rust · single binary · bounded memory

greatmemory

A fast, bounded-memory engine for AI agents. Remember, recall, and distill facts — in a single binary that stays in the hundreds of megabytes no matter how much you feed it.

588.1 MB
peak RSS under full ingest load
19 MB
idle server footprint (Docker)
158 MB
container image
1 binary
server, MCP, and CLI

Memory engines have a memory problem

The common pattern in agent-memory tools — local embeddings running in pooled WASM workers — grows multi-gigabyte heaps that are never returned to the operating system. Holding a few megabytes of notes can cost gigabytes of RAM. greatmemory was designed so that bug class cannot exist: one native ONNX model instance, a bounded ingestion queue with backpressure, and a fixed batch size. Peak memory is model weights plus one batch of activations — a constant, independent of how much you ingest. And it is not a promise on a landing page: it is a regression test that runs in CI and fails the build if it ever stops being true.

Measured, not claimed

Both charts below are drawn from real test runs — no synthetic numbers. The full methodology, raw output, and a one-line command to reproduce them on your own machine are under the charts.

Memory stays flat while data grows

Resident memory while ingesting 51.2 MB across 1,000 documents (31,000 chunks) through the full pipeline — real BGE-small embeddings, real SQLite. RSS at 2.6 MB ingested: ~536 MB. At 51.2 MB: ~541 MB.

Footprint at a glance

Idle, under full load, and the ceiling CI enforces on every push.

Evidence & how to reproduce

Full methodology →
  • The line chart is the verbatim output of crates/gm-cli/tests/ram_regression.rs run on 2026-06-12 (Apple M1 Pro, 16 GB, macOS). The corpus is generated from a fixed seed, so every run ingests the identical 51.2 MB.
  • The idle figure is rss_bytes reported by the server's own GET /v1/stats endpoint after boot in the published Docker image.
  • The same test runs in CI on every push and fails the build if peak RSS crosses 1 GB.

Reproduce it yourself, from the greatmemory source tree:

cargo test -p gm-cli --release -- --ignored ram_regression --nocapture
Show the raw test output behind these charts
[ram_regression] docs=  50  ingested=  2.6 MB  rss= 535.9 MB  peak= 535.9 MB
[ram_regression] docs= 100  ingested=  5.1 MB  rss= 530.5 MB  peak= 535.9 MB
[ram_regression] docs= 150  ingested=  7.7 MB  rss= 529.3 MB  peak= 535.9 MB
[ram_regression] docs= 200  ingested= 10.2 MB  rss= 535.8 MB  peak= 535.9 MB
[ram_regression] docs= 250  ingested= 12.8 MB  rss= 512.6 MB  peak= 535.9 MB
[ram_regression] docs= 300  ingested= 15.4 MB  rss= 578.1 MB  peak= 578.1 MB
[ram_regression] docs= 350  ingested= 17.9 MB  rss= 545.5 MB  peak= 578.1 MB
[ram_regression] docs= 400  ingested= 20.5 MB  rss= 547.7 MB  peak= 578.1 MB
[ram_regression] docs= 450  ingested= 23.1 MB  rss= 520.9 MB  peak= 578.1 MB
[ram_regression] docs= 500  ingested= 25.6 MB  rss= 545.8 MB  peak= 578.1 MB
[ram_regression] docs= 550  ingested= 28.2 MB  rss= 550.6 MB  peak= 578.1 MB
[ram_regression] docs= 600  ingested= 30.7 MB  rss= 543.4 MB  peak= 578.1 MB
[ram_regression] docs= 650  ingested= 33.3 MB  rss= 556.5 MB  peak= 578.1 MB
[ram_regression] docs= 700  ingested= 35.9 MB  rss= 543.7 MB  peak= 578.1 MB
[ram_regression] docs= 750  ingested= 38.4 MB  rss= 530.4 MB  peak= 578.1 MB
[ram_regression] docs= 800  ingested= 41.0 MB  rss= 526.9 MB  peak= 578.1 MB
[ram_regression] docs= 850  ingested= 43.6 MB  rss= 494.9 MB  peak= 578.1 MB
[ram_regression] docs= 900  ingested= 46.1 MB  rss= 560.0 MB  peak= 578.1 MB
[ram_regression] docs= 950  ingested= 48.7 MB  rss= 588.1 MB  peak= 588.1 MB
[ram_regression] docs=1000  ingested= 51.2 MB  rss= 541.3 MB  peak= 588.1 MB
[ram_regression] SUMMARY: ingested 51.2 MB across 1000 docs -> 31000 chunks; peak RSS 588.1 MB (limit 1024 MB)
test ram_regression_50mb_ingest_stays_under_1gb ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2619.96s

Everything an agent needs to remember

Hybrid retrieval

Vector similarity + BM25 full-text + extracted facts, fused with reciprocal rank fusion and boosted by recency.

Fact distillation

An optional LLM (Ollama or any OpenAI-compatible API) extracts subject–predicate–object facts; contradictions supersede with a full audit trail.

MCP built in

remember, recall, get_context, get_profile, forget — over stdio for Claude Code, Codex, OpenClaw, and Hermes, or streamable HTTP.

Local-first, server-ready

SQLite by default — the whole store is one file. Postgres + pgvector for servers via a single GM_DB URL.

REST + OpenAPI

The spec is generated from the serving code at /v1/openapi.json, so generated clients can never drift.

Private by default

Binds to loopback, no accounts, no cloud. Your memories live in a file you own. Offline after one model download.

Running in sixty seconds

# from the greatmemory source tree
cargo install --path crates/gm-cli

gmem serve                      # local-first: loopback + SQLite
gmem add "Staging runs Postgres 17 with pgvector."
gmem search "what does staging run?"

claude mcp add greatmemory -- gmem mcp   # persistent memory for Claude Code

Prefer containers? docker compose up in the repo gives you the same server in a 158 MB image.

greatmemory is built and maintained by Venkatesh Balakumar at Amzu Information Technology Ltd — born out of watching a 5 MB note collection eat nearly 8 GB of RAM, and deciding the fix was a rewrite in Rust with the memory bound enforced by CI rather than promised in a changelog.