Your agent is brilliant for exactly one conversation. Then it forgets you.

It forgets that your company runs on Postgres, not MySQL. It re-asks the question it asked yesterday. It contradicts a decision it helped you make last week. To the person on the other side — a customer, a colleague, your support team — it does not feel like talking to something intelligent. It feels like talking to someone with no recollection of ever having met you.

This is not a bug in any particular model. It is how large language models work. And once you understand why they forget, the fix stops looking like a prompt-engineering trick and starts looking like infrastructure you have to build or buy — the same way you would never ask your app to "remember" customer data without a database behind it.

The blank slate problem

Here is the part that surprises most people, including some engineers: the model itself has no memory at all.

As the engineer Simon Willison puts it in his guide on how coding agents work, "LLMs are stateless: every time they execute a prompt they start from the same blank slate." The vendors say the same in their own docs. OpenAI's API guide notes that "each text generation request is independent and stateless." Anthropic describes the context window — the chunk of text the model can read at once — as a "working memory" for the model, not a permanent one.

So how does ChatGPT seem to follow a conversation? The application re-sends the entire transcript with every message. The "memory" you experience in a chat window is an illusion maintained by the software around the model, which replays everything that came before on each turn. Stop replaying it, and the model knows nothing about you.

That works for a single session. It falls apart the moment you want an agent that remembers across days, users, and projects. You cannot replay everything forever — the context window has a ceiling, and stuffing it full backfires anyway. Anthropic's own guidance is blunt: context "must be treated as a finite resource with diminishing marginal returns." Their docs even name the failure mode — context rot, where accuracy and recall degrade as the window fills up.

The takeaway: the context window is short-term memory, the model's scratchpad for the current task. Long-term memory — durable knowledge of who a user is and what has happened before — has to live somewhere else.

Why this is a business problem, not just a technical one

It is tempting to file "the agent forgets things" under minor annoyances. It is not.

An agent without memory cannot build a relationship. Every support conversation starts from zero, so customers re-explain their setup every time — the digital equivalent of being transferred to a new rep who hasn't read the ticket. A sales agent can't recall what a prospect objected to last call. A coding assistant relearns your stack, your conventions, and your "we tried that, it didn't work" history on every session. The work gets redone, trust erodes, and the thing meant to feel personal feels generic.

There is a cost angle too. If your only tool for continuity is "paste more history into the prompt," you pay for those tokens on every call and hit the quality cliff of context rot as conversations grow. Memory done right is cheaper than memory faked through brute-forced context, because you retrieve only the handful of things that matter for the task at hand.

So "how do we give our agent memory?" is really a question about whether your agent can be a product people come back to, or just a clever demo.

"Just use a vector database" — and why that's only half the answer

The standard first move is retrieval-augmented generation, or RAG: embed your documents, store the vectors, and at query time pull back the most semantically similar chunks to feed the model. It is a genuinely good technique, and if you build agents you should understand it.

But RAG over a pile of documents is retrieval, not memory. The difference matters.

Retrieval answers "what does my knowledge base say about X?" Memory answers "what do I know about this user, and what has changed since I last spoke to them?" A vector search over your help docs will never notice that a customer upgraded their plan, that a fact you stored three months ago is now wrong, or that you were asked to forget something. Documents are static. Memory is a living record that grows, gets corrected, and occasionally needs to be erased.

Put differently: retrieval is about your content; memory is about your relationship. You can build the second on the same vector-search machinery as the first, but you have to add several things plain RAG leaves out.

The actual ingredients of agent memory

Strip the marketing away and durable agent memory comes down to five capabilities — worth getting right whether you build it yourself or adopt something off the shelf.

1. Extraction. Raw transcripts are not memory — too long and too noisy to store wholesale. You distill them into discrete, reusable statements: "User's production database is Postgres 16." "User prefers terse answers, no preamble." Turning a messy chat into a few clean facts worth keeping is most of the art.

2. Durable storage with structure. Those statements go somewhere outside the model that survives restarts. The useful structure is lightweight — a type (fact, stated preference, hard rule, lesson learned?) and a scope (which user, project, or workspace does this belong to?). Scoping is what stops one customer's memories from leaking into another's.

3. Semantic recall, ideally hybrid. At the start of a task, the agent fetches what's relevant — by meaning, not just keyword match, so "the DB" finds the note about Postgres. But pure vector search misses exact strings (error codes, names, IDs) that keyword search nails, so the strongest setups blend both. A common way to merge the two ranked lists is Reciprocal Rank Fusion.

4. Handling change and contradiction. This is where naive systems quietly rot. People change jobs, switch tools, reverse decisions. Memory has to represent when a fact is true and let new information supersede old, rather than confidently serving a stale answer. A fact with a validity window — true from this date, no longer after that one — is far safer than a flat store that only knows the last thing written.

5. Governance: audit and forgetting. If your agent remembers personal information, you need to answer two questions on demand: why does it believe this? and can you delete it? An audit trail (where a memory came from, when, via which agent) answers the first. A real forget operation — not just hiding a row, but honoring a deletion request — answers the second. For anyone handling customer data, these are not nice-to-haves.

None of these five is a model capability. They are all systems work around the model. That is the whole point: memory is infrastructure.

How a developer actually adds it

You have two honest paths.

Build it. Stand up a vector store, write the extraction prompts, design a schema with types and scopes and validity windows, wire up hybrid recall, add an audit log and a delete path, then keep all of it tuned as your data grows. It is very doable — but it is a real project, closer to building a small database product than to adding a feature. If memory is your core differentiator, build it.

Use a hosted memory layer. If memory is plumbing rather than your product, a service that exposes these five capabilities behind an API will get you there faster. The integration pattern is simple: after a conversation, send the transcript to be distilled into memories; before the next task, recall the relevant ones and inject them into the prompt.

This is the category we work in, so let me be specific and honest about how AgentPrizm approaches it, as one example of the shape these tools take. It is reachable two ways — a plain REST API (the agent-facing routes live under /api/v1/agent/*) and a Model Context Protocol server. MCP, the open standard Anthropic introduced in late 2024 and since adopted across the ecosystem, lets a compatible agent connect with just a URL and a key — no glue code to write.

Memories are scoped with containers (per user, per project, whatever boundary you draw) and sorted into six types — fact, lesson, directive, preference, contact, bookmark — kept deliberately small so the categories stay obvious. Recall is hybrid, blending semantic vector search with keyword matching so exact strings don't slip through. Facts carry validity windows, so a changed answer supersedes the old one instead of contradicting it. And because it's customer data, every memory has an audit trail and a real forget operation for honoring deletion requests.

A minimal recall call looks about like this:

// Before the next task, pull what's relevant for this user.
const memories = await client.recall({
  query: "what database does this customer use?",
  containers: ["customer-4821"],
});
// Inject `memories` into your prompt. The model does the rest.

The mechanics aren't magic — they're the five ingredients above, run as a service so you don't maintain them. The docs cover every endpoint, and pricing starts at a free tier so you can wire it up before committing.

The pragmatic takeaway

Stop thinking about the context window as your agent's memory. It is the model's short-term scratchpad, finite by design, and it forgets everything the moment the call ends. That is not a limitation to engineer around with ever-longer prompts; it is a signal that memory belongs in its own layer.

Real long-term memory for AI agents needs five things: extraction from conversations, durable scoped storage, hybrid semantic recall, a way to handle change and contradiction, and governance you can stand behind. Build it yourself if memory is your product. Reach for a hosted layer if it's the plumbing under one.

Either way, the agents worth building are the ones that remember you tomorrow. The model gives you intelligence. Memory turns intelligence into a relationship.

How to Give an AI Agent Long-Term Memory: A Developer's Guide