Architecture & integration — greatmemory docs

greatmemory is a layered Rust workspace compiled into one binary. Three interfaces — REST, MCP, CLI — sit on one engine, and the engine talks to storage through ports, so SQLite and Postgres are interchangeable behind the same behavior.

The layers

                ┌───────────────────────────────────────────┐
                │   gm-cli — one binary: serve │ mcp │ add  │
                │            search │ facts │ status        │
                └─────────────────────┬─────────────────────┘
                ┌─────────────────────┴─────────────────────┐
                │   gm-server — axum HTTP API (/v1) + MCP   │
                │        (stdio & streamable HTTP)          │
                └──┬──────────────┬──────────────────┬──────┘
        ┌──────────┴───┐  ┌───────┴──────┐  ┌────────┴────────┐
        │  gm-ingest   │  │ gm-retrieve  │  │     gm-llm      │
        │ chunk, embed,│  │ vector+BM25+ │  │ fact extraction │
        │ extract facts│  │ facts, RRF   │  │ (ollama/openai) │
        └──────┬───────┘  └───────┬──────┘  └─────────────────┘
        ┌──────┴───────┐  ┌───────┴──────────────────────────┐
        │   gm-embed   │  │  gm-core — domain types + ports  │
        │ fastembed /  │  │  (Storage, Embedder, LlmProvider)│
        │ ollama/openai│  └───────┬──────────────────────────┘
        └──────────────┘  ┌───────┴──────────────────────────┐
                          │ gm-storage-sqlite │ gm-storage-  │
                          │ (sqlite-vec+FTS5) │ postgres     │
                          │                   │ (pgvector)   │
                          └──────────────────────────────────┘

The write path

Adds return immediately; everything expensive happens behind a bounded queue.

 POST /v1/memories │ remember │ gmem add
        │
        ▼
 ┌─────────────┐    ┌──────────────┐    ┌─────────────────────┐
 │ store raw   │───▶│ chunk        │───▶│ embed               │
 │ document,   │    │ ~400 tokens, │    │ bounded queue (16), │
 │ reply 202   │    │ 15% overlap  │    │ fixed batches (32), │
 └─────────────┘    └──────────────┘    │ ONE model instance  │
                                        └──────────┬──────────┘
                                                   ▼
 ┌──────────────────────────┐    ┌────────────────────────────┐
 │ async fact extraction    │◀───│ store chunks + vectors     │
 │ (only if GM_LLM is set)  │    │ (sqlite-vec / pgvector     │
 │ subject/predicate/object │    │  + full-text index)        │
 └────────────┬─────────────┘    └────────────────────────────┘
              ▼
 ┌──────────────────────────────────────────┐
 │ contradiction supersession:              │
 │ same subject+predicate, new object       │
 │ → old fact marked superseded (audited)   │
 └──────────────────────────────────────────┘

The bounded queue and fixed batch size are why peak memory is a constant — model weights plus one batch of activations — regardless of how much you ingest. Fact extraction is fully asynchronous: a slow or offline LLM never blocks adds.

The read path

One query fans out to three retrieval lanes, which are fused and assembled.

 POST /v1/search │ recall │ get_context
        │
        ├──────────────────┬──────────────────┐
        ▼                  ▼                  ▼
 ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
 │ vector lane │    │  BM25 lane  │    │  fact lane  │
 │ embedding   │    │ full-text   │    │ extracted   │
 │ similarity  │    │ keyword     │    │ facts       │
 └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
        └──────────────────┼──────────────────┘
                           ▼
              ┌─────────────────────────┐
              │ reciprocal rank fusion  │
              └────────────┬────────────┘
                           ▼
              ┌─────────────────────────┐
              │     recency boost       │
              └────────────┬────────────┘
                           ▼
        mode: recall ──▶ scored chunks + facts (JSON)
        mode: context ─▶ assembled block: facts first,
                         then chunks, trimmed to the
                         token budget, with citations

Integrating greatmemory

Into existing services

Run greatmemory as a sidecar container (or one shared instance per environment) and call the REST API — three endpoints cover the whole loop: POST /v1/memories, POST /v1/search, GET /v1/profile. Working patterns:

One space per service (space: "billing", space: "support-bot"), so each service's memories stay isolated while sharing one engine and one database.
One API key per service via GM_API_KEYS=gm_billing_...,gm_support_... — rotating one consumer never touches the others. (Keys authenticate; they don't scope spaces in v0.1 — see multi-tenant notes below.)
Generated clients: point your OpenAPI generator at GET /v1/openapi.json; the spec is generated from the serving code, so it cannot drift.

Into new services and agents

REST, SDK-less: the API is small enough to call with fetch/requests directly — copy-paste clients in the custom agents guide.
Put get_context in your prompt-assembly step: before each model call, POST /v1/search with mode: "context" and a token budget, and inject the returned block into the system prompt. Facts come first in the block, so the model sees distilled truth before raw memories.
Profile at session start: fetch GET /v1/profile once when a session opens and pin it as standing context; use get_context per question after that.
MCP-native agents skip HTTP entirely: register gmem mcp (stdio) and let the agent call remember / recall / get_context itself — see the MCP reference.

Multi-tenant

One space per user or tenant (space: "user-42") is the isolation mechanism: memories, search results, facts, and profiles are all space-scoped, so retrieval can never mix tenants — as long as your backend sets the space on every call.
Key scoping note: in v0.1, any valid API key can read any space — spaces isolate data, keys authenticate callers, and the mapping from authenticated user to space is your backend's job. Don't hand end users a server key; proxy their requests through your service, which injects the right space. Per-key space scoping arrives with the key-management work in a later release.
For many tenants on one instance, prefer Postgres + pgvector (GM_DB) over SQLite, and watch rss_bytes on /v1/stats — bounded memory means it stays flat as tenants and data grow.