An AI agent without durable memory is a brilliant intern with amnesia. It ships a clean fix on Monday, forgets the codebase by Tuesday, and re-introduces the same regression on Wednesday. At CodeCourier we kept hitting that wall - until we stopped treating memory as a feature and started treating it as the core substrate. This post is the engineering deep dive on how our Contexts layer works: the embedding strategy, the hybrid retrieval pipeline, the scoping model, the eval framework, and the surprising lessons we collected along the way.
I am writing this builder-to-builder. If you are designing AI agent memory for a coding agent, a support agent, or an internal copilot, the same architectural questions will ambush you. We will name them, and show the answers we shipped to production.
1. What is durable context for an AI agent?
Durable agent context is the persistent, retrievable, scoped knowledge that survives across agent runs and gets selectively injected into a model's context window at inference time. Unlike the conversation buffer (which dies when the session ends) and unlike fine-tuning (which freezes knowledge into weights), durable context is hot, editable, addressable state that an agent can read, cite, and update without retraining.
At CodeCourier, a Context is the unit of that durable memory. Concretely, it is a versioned bundle of fragments - paragraphs, code samples, ADRs, runbooks, past Issue Session summaries, security policies - addressable by ID, scoped to an organization, repository, persona, or session, and indexed for hybrid retrieval (lexical plus semantic, with a reranker). When a new Issue Session spins up inside a sandbox, the right Contexts are selected, ranked, compressed, and pinned to the prompt automatically.
The shorthand we use internally: weights are for skills, context is for facts. If the knowledge changes faster than your next training run, it belongs in the context layer.
2. Why naive RAG breaks for engineering agents
The naive recipe - chunk everything, embed it, store vectors, do top-K cosine search, append to prompt - works fine for a marketing-doc chatbot. It collapses fast for an engineering agent. Five failure modes we measured in the wild:
- Semantic drift on code-adjacent queries. Dense embeddings rate "authentication" and authorization as near-twins. The agent retrieves the wrong policy and confidently writes a broken middleware. In one golden set, naive cosine retrieval scored recall@10 = 0.61 on ambiguous auth queries; our hybrid pipeline lifted it to 0.87.
- Chunk boundaries shred semantics. A 512-token chunk slices through a function body, leaving the agent with the second half of a SQL transaction and no idea what was being rolled back.
- Top-K is opaque. When an agent hallucinates, you need a postmortem. Pure vector search leaves no audit trail beyond "these eight vectors were closest" - which is unfalsifiable and unfixable.
- No scope safety. One tenant's embeddings leak into another tenant's retrieval because the index is global. The CISO finds out. You have a bad week.
- Recency rot. The embedding for a six-month-old ADR is just as "close" as the one written yesterday that superseded it. Without freshness signals, the agent cites the corpse.
Naive RAG is a demo. A real RAG architecture for agents needs hybrid retrieval, structured scoping, freshness, citations, and an eval loop that catches regressions before they ship.
3. Our architecture at a glance
Picture three vertical lanes. On the left, an ingestion lane ingests source documents - markdown, code, transcripts, prior agent outputs - and emits normalized fragments. In the middle, an index lane writes those fragments into a BM25 inverted index and a dense vector index in parallel, with shared fragment IDs. On the right, a retrieval lane serves agent queries: it fans out lexical and dense searches, fuses the candidates, reranks them with a cross-encoder, and returns a small, citable bundle that fits the prompt budget.
End to end, the pipeline runs in seven ordered stages:
- Normalize - strip noise, canonicalize whitespace, detect language, attach source metadata (repo, path, author, timestamp).
- Split - code-aware chunking with overlap; we use AST boundaries for code, heading boundaries for prose.
- Embed - generate dense vectors per fragment; also produce a sparse lexical signature.
- Index - write to dense store (HNSW) and BM25 shard, keyed by fragment ID, with scope tags.
- Retrieve - for each query, run BM25 top-100 and dense top-100 in parallel.
- Fuse and rerank - Reciprocal Rank Fusion merges, then a cross-encoder reranker scores the top-50.
- Compose - apply scope filters, freshness boosts, citation graph, manual pins; compress and pin to the agent prompt.
Every stage is independently observable. Every fragment that ends up in a prompt is traceable back to the source document, the chunker version, the embedding model version, and the rerank score. That auditability is what makes the system debuggable in production - and it is what we wish naive RAG implementations had on day one.
4. Embedding strategy - model, chunking, dedup
Three decisions dominate embedding quality: which model, how to split, and how to dedup.
Model choice
We use a 1024-dimensional general-purpose embedding for prose and a code-specialized embedding for source files, stored in separate namespaces. Mixing modalities into one namespace degraded recall by roughly 9 points on our internal eval; the code embedding pulls function-shaped content closer in its own space, while the prose model handles ADRs and policies better. We re-embed on every model upgrade and keep the prior generation live during a 14-day shadow window.
Code-aware chunking
Generic 512-token windows shred functions. We split code along AST boundaries - function, class, top-level statement - and attach a header carrying the file path and enclosing symbol. Prose splits on heading hierarchy. Both modes keep a 64-token overlap to preserve cross-boundary semantics. Mean fragment size lands at 320 tokens; p95 at 780.
Dedup and canonical fragments
Engineering corpora are full of near-duplicates: the same code sample pasted into three runbooks, the same paragraph mirrored from a wiki into a README. We compute a SimHash signature per fragment and collapse near-duplicates (Hamming distance ≤ 3) to a single canonical fragment with multiple source pointers. This trimmed our index by 34% on a representative tenant and improved reranker precision because the cross-encoder no longer wastes capacity on twins.
5. Hybrid retrieval - BM25 + dense + reranker
Hybrid retrieval is the load-bearing wall of a serious agent knowledge base. Dense embeddings catch paraphrase; BM25 catches exact identifiers - function names, error codes, table columns. Neither alone is enough.
Our retrieval call, in pseudo-schema:
POST /contexts/retrieve
{
"query": "why does the rate limiter drop requests on burst?",
"scope": {
"org_id": "org_7Hk2",
"repo_id": "repo_courier-api",
"persona_id": "persona_backend-sre",
"session_id": "iss_2026-03-19-A91"
},
"budget_tokens": 3200,
"k_lexical": 100,
"k_dense": 100,
"rerank_top": 50,
"freshness_halflife_days": 90,
"min_score": 0.42
}
200 OK
{
"fragments": [
{
"id": "frag_8f21",
"doc_id": "adr_rate-limit-v4",
"version": 4,
"score": 0.913,
"tokens": 412,
"source": "runbooks/rate-limit.md#burst",
"updated_at": "2026-02-11T08:14:00Z"
}
],
"trace_id": "trc_8c1d"
}
Numbers we hold ourselves to on the production tier:
- recall@10 = 0.87 on the engineering golden set (1,200 labeled queries), up from 0.61 with cosine-only.
- MRR@10 = 0.74 after cross-encoder rerank.
- p50 retrieval latency = 88 ms, p95 = 240 ms for budget ≤ 4k tokens, including rerank.
- Hallucination rate on the agent's final answer (judged by an LLM-as-judge with human spot checks) fell from 14.1% to 3.2% after the reranker shipped.
The reranker is a small cross-encoder. It scores 50 candidates and we keep whatever fits the token budget after compression. Compression is dumb on purpose - we drop boilerplate headers, keep the heading path, and never paraphrase the body. Agents cite exact text or they cite nothing.
6. Scoping and permissions - repo, org, persona, session
Scope safety is non-negotiable. A single embedding index that serves multiple tenants is a data-exfiltration incident waiting for a calendar invite. We attach scope tags to every fragment at write time and enforce them at query time, not by post-filtering but by query-time index partitioning.
Four scopes, evaluated bottom-up:
- Session - ephemeral fragments produced during the current Issue Session, only visible inside that session.
- Persona - fragments tied to a specific Persona (e.g. backend SRE, frontend accessibility reviewer); they bias retrieval toward the persona's domain.
- Repository - code-tied knowledge, scoped to a repo and inheriting that repo's access policy.
- Organization - cross-repo knowledge such as security policy and brand voice, gated by org membership.
At retrieve time the requesting principal's ACL set is intersected with each candidate fragment's scope set. A fragment with no overlap is never even scored. We treat scope as a correctness property, not a feature flag, and we test it the way we test auth code. The discipline is documented in our security overview.
7. Ranking and freshness - recency, citations, pins
Relevance is a function of similarity, recency, authority, and operator intent. The final ranker is a weighted blend:
score(f) = w_r * rerank(f, q)
+ w_f * exp(-age_days(f) / halflife)
+ w_c * citation_pagerank(f)
+ w_p * is_pinned(f, scope)
Weights are tuned per persona. A security reviewer's Contexts skew toward authority (citation graph and pins); a frontend agent fixing an open issue skews toward freshness. The citation graph is built offline: when a fragment links to or is cited by other fragments, it accumulates PageRank-like authority. Manual pins always win - a tech lead can pin a fragment to the top of a scope, and the system honors it without argument.
Recency uses a half-life decay, not a hard cutoff. A two-year-old ADR can still surface if nothing fresher exists; a one-week-old ADR on the same topic will simply dominate. The half-life is configurable; default is 90 days for engineering content and 365 days for policy.
8. Eval framework - how we measure retrieval quality
Retrieval quality is the only number that matters, and it is the easiest to fool yourself about. Our eval framework rests on three layers, each more expensive and more honest than the last.
- Golden sets. 1,200 labeled queries with human-curated ideal fragments. Run on every change to the pipeline. Recall@k, MRR@k, nDCG@k, and per-scope breakdowns.
- Regression replays. Real production queries (scrubbed and consented) replayed against the candidate pipeline; we diff the retrieved fragments and flag any query whose top-3 changes by more than a token-overlap threshold.
- End-to-end agent evals. Full agent runs in isolated sandboxes against scored tasks; we judge final outputs with an LLM-as-judge and human spot checks. This is the only layer that measures the thing we actually ship.
Every PR that touches the retrieval path must beat the prior pipeline on the golden set or land with a written waiver. Most engineering teams under-invest here. We have shipped versions that looked great on a demo and lost three points of recall in production. We do not ship them anymore.
9. Surprising design decisions
Write less, link more
Customers tried to dump full Notion exports and entire Slack archives into Contexts. Recall went down. The reranker drowned in near-duplicates and the agent got distracted. We rewrote the UI to enforce minimal fragments - the smallest unit that captures one decision - and added on-demand link resolution. Median fragment length dropped from 1,400 tokens to 320. Hallucination rate dropped with it.
Structure beats prose
Fragments authored as bulleted decision lists outperform the same information rendered as flowing prose. Our hypothesis: the agent treats structure as cheap-to-parse scaffolding and spends more attention on the content. We bake this into the authoring guidance in our guides.
Embeddings are a tiebreaker, not a primary signal
Counterintuitive for anyone raised on pure vector RAG, but BM25 + scope filtering does most of the heavy lifting. The dense embedding settles ties and rescues paraphrase queries. If we had to ship one without the other, we would keep BM25.
Negative contexts work
We added negative fragments - explicit anti-patterns and known footguns - and saw a measurable drop in repeat regressions. On one customer's bug class (silent retries masking 5xx errors), repeat incidents fell 41% after the anti-pattern fragment was pinned to the repo scope.
Owners or rot
Every fragment has a named human owner. When the system flags a fragment as stale - referenced often but not edited in six months - the owner gets a nudge through their workflow. Without owners, the index degrades into zombie documentation; with owners, it compounds.
10. Naive RAG vs Contexts - comparison
| Dimension | Naive RAG | CodeCourier Contexts |
|---|---|---|
| Retrieval | Dense cosine only | BM25 + dense + cross-encoder rerank |
| recall@10 (golden set) | 0.61 | 0.87 |
| p95 latency | ~120 ms (no rerank) | ~240 ms (with rerank) |
| Hallucination rate (final answer) | 14.1% | 3.2% |
| Scope safety | Post-filter (leaky) | Index partitioning + ACL intersection |
| Freshness | None | Half-life decay + manual pins |
| Auditability | Vector IDs only | Fragment ID + version + source + rerank score |
| Authoring discipline | Dump and pray | Minimal fragments + owners + versions |
The wins compound. A faster, more precise retrieval pipeline makes the prompt smaller, which lowers inference cost and tail latency, which lets us run more steps per Issue Session, which produces better agent outputs, which feeds back into the Contexts as canonical fragments.
11. Open questions and what is next
We are not done. Three threads are actively in flight. Cross-team Contexts - letting a platform team publish a Context that downstream teams subscribe to without forking a copy - is the most-requested feature in the queue and the hardest to get right without breaking scope guarantees. Persona-conditioned reranking uses the active Persona as an additional reranker feature, so the same query returns different fragments for an SRE and an accessibility reviewer. Workflow-aware retrieval ties Contexts into the workflow builder so a step that edits Stripe code automatically pulls in the payments runbook without anyone wiring it up.
If you want a deeper look at the platform itself, the CodeCourier homepage is the fastest tour, and we keep practical patterns documented in the engineering blog. If you are building your own agent memory layer and want to compare notes, get in touch via contact.
12. FAQ
What is AI agent memory?
AI agent memory is the persistent, retrievable knowledge an agent uses across sessions. It is distinct from the conversation buffer (which is ephemeral) and from model weights (which are static). Practical agent memory combines durable context (facts, decisions, code), retrieval (BM25 + dense + rerank), and a scoping model that controls who sees what.
Why not just fine-tune the model on our codebase?
Fine-tuning bakes knowledge into weights, which is the wrong place for facts that change weekly. It is also expensive, slow to iterate, hard to audit, and impossible to scope per tenant. Durable context wins on cost, freshness, and governance. Use fine-tuning for skills (style, format, behavior), not for facts.
Why is hybrid retrieval better than pure vector search?
Dense embeddings excel at paraphrase but miss exact identifiers; BM25 nails identifiers but misses paraphrase. Fusing them with Reciprocal Rank Fusion and a cross-encoder reranker lifted our recall@10 from 0.61 to 0.87 on a labeled engineering golden set.
How big should a context fragment be?
Smaller than you think. Our median is 320 tokens, p95 is 780. Small fragments rerank more accurately, compose better, and leave more budget for the agent's own reasoning.
How do you keep tenants from leaking into each other?
Scope is enforced at query time via index partitioning and ACL intersection - not by post-filtering. Fragments without scope overlap with the requesting principal are never scored. We test scope the way we test authentication code.
How do you measure retrieval quality?
Three layers: a 1,200-query golden set with recall, MRR, and nDCG; replays of scrubbed production queries with diff detection; and end-to-end agent runs scored by an LLM-as-judge with human spot checks. Every retrieval-path PR must beat the prior pipeline or land with an explicit waiver.
Do embeddings or BM25 matter more?
BM25 carries more weight in our system than most teams expect. If we had to ship one, we would ship BM25 + scoping. The dense embedding earns its keep on paraphrase and code-vs-prose tie breaking.
What is a negative context, and why does it help?
A negative context is a fragment that encodes an anti-pattern or known footgun. Pinned to a scope, it tells the agent what not to do. We have measured 30 to 40% drops in repeat regressions on specific bug classes after pinning the right anti-pattern fragment.
The context layer is the part of an agent platform that turns one-off cleverness into compounding institutional intelligence. Build it like infrastructure - versioned, scoped, owned, and small - and your agents stop being interns with amnesia.