Every time we talk to a team about giving their agent memory, someone makes the same reasonable-sounding objection: "Context windows are huge now. We can fit a million tokens. Why do we need a separate memory layer — can't we just send the whole history every time?"
It is a fair question. The windows really are enormous, and the temptation to treat one as the agent's brain is strong. But a big context window and a memory layer are not the same tool, and you cannot substitute one for the other any more than you can replace your hard drive with more RAM. They solve different problems. Let me explain the difference in plain terms, then show where each one actually belongs.
First, two words, defined simply. A token is a chunk of text — roughly three-quarters of a word. The context window is the maximum number of tokens the model can read in a single request: the prompt, the documents you paste in, the conversation so far, everything. When people say "a million-token context window," they mean the model can take in about 750,000 words at once. That is genuinely a lot, and for some jobs it is exactly what you want.
So let me be fair before I bust anything: big context windows are great. Dropping an entire contract, a long log file, or a whole codebase module into one request and asking questions about it is a real superpower. If the thing you care about fits in the window and arrives in one shot, a large window is the right tool. Nobody should give that up.
The trouble starts when you try to make the window do memory's job. Here is where it falls down.
The window is per-call and ephemeral
This is the one that sinks the whole idea, and it is the simplest to understand. The context window exists only for the duration of a single request. The model does not "remember" what was in the last window. It has no memory of your previous conversation at all — none. The illusion of memory in a chat is created entirely by your application re-sending the prior turns each time.
So the moment a session ends, it is gone. Close the chat, restart the agent, come back tomorrow — the window is empty again. Whatever the user told it last week, whatever the agent learned solving last month's ticket, vanished unless you saved it somewhere and decided to put it back. That "somewhere" and that "decide what to put back" — that is memory. The window is a desk; memory is the filing cabinet. A bigger desk does not file anything for you.
For an executive, this is the bottom line: a large context window does not persist across sessions. It is not a database. If your product needs to know a returning customer's preferences, or your coding agent should recall how it solved a problem two sprints ago, the window alone will never deliver that, no matter how big it gets.
You re-send — and re-pay for — the whole history every turn
Models are billed per token, on input as well as output. The context window is input. So if your strategy is "stuff the entire history into the window every time," you are paying to re-send that entire history on every single turn.
Walk through what that means. Turn one, you send a little. Turn fifty, you are re-sending forty-nine turns of accumulated history just to ask the fiftieth question — and you will do it again on turn fifty-one, and again on fifty-two. Cost grows with the square of the conversation's length, because each turn re-pays for everything before it. A long-running agent that "just keeps everything in context" gets quietly, relentlessly more expensive the longer it runs. The demo is cheap. The deployment is not.
A memory layer flips this. Instead of re-sending everything, you retrieve the handful of relevant facts and send only those. You pay for the few hundred tokens that matter, not the hundred thousand that don't.
Reliability degrades as the window fills
Even setting cost and persistence aside, the assumption that "if it fits, the model will use it" is wrong — and this is the part developers most often miss.
Anthropic, in its guide on context engineering for agents, names the effect directly: "as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases." They call it context rot. The reason is architectural, not a bug to be patched away. Anthropic frames it as a budget: "Like humans, who have limited working memory capacity, LLMs have an 'attention budget' that they draw on when parsing large volumes of context." Every token you add spends some of that budget. As they put it, "Context, therefore, must be treated as a finite resource with diminishing marginal returns." (Effective context engineering for AI agents, Anthropic)
This is not a single vendor's opinion. The foundational research is the 2023 paper "Lost in the Middle" by Nelson Liu and colleagues at Stanford, which tested how models actually use long inputs. Their finding: "performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models." (Lost in the Middle, Liu et al., 2023)
Read that last clause again — even for explicitly long-context models. Having a big window does not mean the model reliably reads all of it. Bury the one fact that matters in the middle of 200,000 tokens of chat history, and the model is measurably likely to miss it. So "just put everything in the window" does not even reliably achieve recall, the one thing it is supposed to be good at. You can pay for a million tokens and still have the model overlook the sentence you needed.
The takeaway, which Anthropic states as the goal of context engineering, is to find "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome." Smallest, high-signal. That is the opposite of dumping the whole history in and hoping.
No structure, no governance
The last gap is the one that turns up the day a real company depends on the system. A context window is an undifferentiated blob of text. It has no notion of what kind of thing each statement is, when it was true, whose it is, or where it came from.
That means the window gives you no answer to the questions that matter in production:
- Validity. "Rachel is a Senior PM" was true in March and wrong after her promotion. A raw window will re-feed the stale version forever. Memory needs to know when a fact stops being true.
- Scoping. One customer's data must not bleed into another's. A shared window has no walls. Memory needs containers that keep tenants and projects isolated.
- Forgetting. Someone asks you to delete what the agent knows about them. You cannot "delete" from a context window — there is nothing to delete; it is rebuilt from scratch each call. You need a real store with a real delete path.
- Audit. Why did the agent know that, and where did it come from? A blob of text has no provenance. Memory does.
None of that is something a bigger window will ever provide, because it is a different kind of capability entirely. Size does not buy structure.
They are complementary, not competing
Here is the right mental model. The context window is the working space the model thinks in. Memory is the system that decides what deserves to be in that working space right now. They are not rivals; memory's whole job is to make the window count — to put the smallest, highest-signal set of tokens in front of the model and leave the other 999,000 out.
That is what we built AgentPrizm to do. Your agent stores what it learns as discrete memories across six types — facts, lessons, directives, preferences, contacts, and bookmarks — each scoped to a container so tenants and projects stay isolated, and carrying a validity window so stale facts can expire instead of haunting every prompt. On the way back in, recall is hybrid: semantic search to catch meaning plus keyword matching to catch the exact order ID or error code that pure embeddings miss. And because the point is to feed the window well, AgentPrizm hands back a token-budgeted context block — the relevant memories, ranked and trimmed to fit the space you give it — rather than a firehose. Every memory keeps an audit trail, and forgetting is a real, supported operation.
So when you hear "context windows are huge now, why do we need memory" — the honest answer is that the size of the window was never the problem memory solves. The window is ephemeral, it re-bills the whole history each turn, it degrades at recall as it fills, and it has no structure to govern. A memory layer is what makes a big window worth having. Use the large window for what it is great at — reasoning over a lot of material in one shot — and let memory decide, cheaply and reliably, what that material should be.
If you want to see what that costs in practice, our pricing starts free, and the docs walk through wiring recall into your agent in a few lines.