greatmemory is a layered Rust workspace compiled into one binary. Three interfaces — REST, MCP, CLI — sit on one engine, and the engine talks to storage through ports, so SQLite and Postgres are interchangeable behind the same behavior.
The layers
┌───────────────────────────────────────────┐
│ gm-cli — one binary: serve │ mcp │ add │
│ search │ facts │ status │
└─────────────────────┬─────────────────────┘
┌─────────────────────┴─────────────────────┐
│ gm-server — axum HTTP API (/v1) + MCP │
│ (stdio & streamable HTTP) │
└──┬──────────────┬──────────────────┬──────┘
┌──────────┴───┐ ┌───────┴──────┐ ┌────────┴────────┐
│ gm-ingest │ │ gm-retrieve │ │ gm-llm │
│ chunk, embed,│ │ vector+BM25+ │ │ fact extraction │
│ extract facts│ │ facts, RRF │ │ (ollama/openai) │
└──────┬───────┘ └───────┬──────┘ └─────────────────┘
┌──────┴───────┐ ┌───────┴──────────────────────────┐
│ gm-embed │ │ gm-core — domain types + ports │
│ fastembed / │ │ (Storage, Embedder, LlmProvider)│
│ ollama/openai│ └───────┬──────────────────────────┘
└──────────────┘ ┌───────┴──────────────────────────┐
│ gm-storage-sqlite │ gm-storage- │
│ (sqlite-vec+FTS5) │ postgres │
│ │ (pgvector) │
└──────────────────────────────────┘
The write path
Adds return immediately; everything expensive happens behind a bounded queue.
POST /v1/memories │ remember │ gmem add
│
▼
┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐
│ store raw │───▶│ chunk │───▶│ embed │
│ document, │ │ ~400 tokens, │ │ bounded queue (16), │
│ reply 202 │ │ 15% overlap │ │ fixed batches (32), │
└─────────────┘ └──────────────┘ │ ONE model instance │
└──────────┬──────────┘
▼
┌──────────────────────────┐ ┌────────────────────────────┐
│ async fact extraction │◀───│ store chunks + vectors │
│ (only if GM_LLM is set) │ │ (sqlite-vec / pgvector │
│ subject/predicate/object │ │ + full-text index) │
└────────────┬─────────────┘ └────────────────────────────┘
▼
┌──────────────────────────────────────────┐
│ contradiction supersession: │
│ same subject+predicate, new object │
│ → old fact marked superseded (audited) │
└──────────────────────────────────────────┘
The bounded queue and fixed batch size are why peak memory is a constant — model weights plus one batch of activations — regardless of how much you ingest. Fact extraction is fully asynchronous: a slow or offline LLM never blocks adds.
The read path
One query fans out to three retrieval lanes, which are fused and assembled.
POST /v1/search │ recall │ get_context
│
├──────────────────┬──────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ vector lane │ │ BM25 lane │ │ fact lane │
│ embedding │ │ full-text │ │ extracted │
│ similarity │ │ keyword │ │ facts │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
└──────────────────┼──────────────────┘
▼
┌─────────────────────────┐
│ reciprocal rank fusion │
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ recency boost │
└────────────┬────────────┘
▼
mode: recall ──▶ scored chunks + facts (JSON)
mode: context ─▶ assembled block: facts first,
then chunks, trimmed to the
token budget, with citations
Integrating greatmemory
Into existing services
Run greatmemory as a sidecar container (or one shared instance per environment) and call the REST API — three endpoints cover the whole loop: POST /v1/memories, POST /v1/search, GET /v1/profile. Working patterns:
- One space per service (
space: "billing",space: "support-bot"), so each service's memories stay isolated while sharing one engine and one database. - One API key per service via
GM_API_KEYS=gm_billing_...,gm_support_...— rotating one consumer never touches the others. (Keys authenticate; they don't scope spaces in v0.1 — see multi-tenant notes below.) - Generated clients: point your OpenAPI generator at
GET /v1/openapi.json; the spec is generated from the serving code, so it cannot drift.
Into new services and agents
- REST, SDK-less: the API is small enough to call with
fetch/requestsdirectly — copy-paste clients in the custom agents guide. - Put
get_contextin your prompt-assembly step: before each model call,POST /v1/searchwithmode: "context"and a token budget, and inject the returned block into the system prompt. Facts come first in the block, so the model sees distilled truth before raw memories. - Profile at session start: fetch
GET /v1/profileonce when a session opens and pin it as standing context; useget_contextper question after that. - MCP-native agents skip HTTP entirely: register
gmem mcp(stdio) and let the agent callremember/recall/get_contextitself — see the MCP reference.
Multi-tenant
- One space per user or tenant (
space: "user-42") is the isolation mechanism: memories, search results, facts, and profiles are all space-scoped, so retrieval can never mix tenants — as long as your backend sets the space on every call. - Key scoping note: in v0.1, any valid API key can read any space — spaces isolate data, keys authenticate callers, and the mapping from authenticated user to space is your backend's job. Don't hand end users a server key; proxy their requests through your service, which injects the right
space. Per-key space scoping arrives with the key-management work in a later release. - For many tenants on one instance, prefer Postgres + pgvector (
GM_DB) over SQLite, and watchrss_byteson/v1/stats— bounded memory means it stays flat as tenants and data grow.