A senior engineer can stand up "agent memory" in an afternoon. Embed the text, drop the vectors in a table, query by cosine similarity on the way in. It demos beautifully. Everyone in the room nods. The decision feels made.
The decision is not made. You have built the tip of an iceberg and mistaken it for the iceberg.
The honest question is never "can we build agent memory?" Of course you can — it is a few lines against a vector index, and Anthropic explicitly recommends starting exactly that simply, using "the simplest solution possible, and only increasing complexity when needed" (Building effective agents, Anthropic). The real question, the one a CTO actually has to answer, is whether the finished version of this is something your team should own for the next three years. That is a different question, and it has a different answer.
The iceberg under "just store embeddings"
Storing and retrieving an embedding is the demo. Here is the part that does not fit in the demo, roughly in the order you discover you need it.
Extraction. A conversation is not a memory. Somewhere you have to decide what is worth remembering out of a wall of chat — the customer's billing preference, yes; their "lol ok thanks," no. That is a judgment call you are now making thousands of times a day, programmatically, and getting wrong in both directions: remembering noise, forgetting signal.
Deduplication and contradiction. The same fact arrives five times in slightly different words. Do you store five copies and let recall return near-duplicates that crowd out everything else? And the harder case: the new memory contradicts an old one. The user said Postgres last month; today they say MySQL. A naive store keeps both, and now recall hands your agent two confident, opposing facts and no way to choose.
Validity and decay. Some facts are true forever; most have a shelf life. "Rachel is a Senior PM" was true when you learned it and is wrong after her promotion. Memory that cannot represent when something was true will, given enough time, poison decisions with faithfully-recalled stale data. Building this means validity windows, supersession logic, and a cleanup job that does not nuke the wrong things.
Hybrid recall. Pure semantic search misses exact tokens — an error code, an order ID, a person's last name. Pure keyword search misses meaning. Production recall needs both, blended and ranked, and the blend is where most of the actual quality lives. This is the part everyone underestimates most.
Audit and right-to-forget. The moment a real company depends on this, someone asks: why did the agent know that, where did it come from, and can we delete it on request? Now you need provenance on every memory, a real delete path (soft and hard), and a trail you can show a customer or a regulator. None of that is a vector problem.
Scaling and on-call. Vector indexes degrade as they grow, and recall latency creeps into your agent's response time. Memory is now load-bearing — if it is down, your agent gets amnesia mid-task. So it needs the same uptime as anything in the request path: monitoring, alerting, a human who gets paged at 3 a.m. That human is on your team now.
Notice how little of that is "store embeddings." Storage was the easy 5%. The other 95% — extraction, conflict resolution, lifecycle, ranking quality, governance, operations — is exactly the part that never showed up in the prototype that convinced everyone this was a weekend project.
This is the textbook shape of undifferentiated heavy lifting
There is a useful, old frame for this decision. In 2006 Jeff Bezos popularized the idea of "undifferentiated heavy lifting" — the necessary backend work that every company has to do but that does not make any one company's product better. AWS's own framing was blunt: developers "routinely spend 70% of their time" on this kind of backend muck — hosting, bandwidth, scaling, hardware — leaving only the remainder for the work that actually differentiates the business (We Build Muck, So You Don't Have To, AWS).
Agent memory plumbing has the same shape. Your customers do not buy your product because your deduplication logic is elegant or your recall blend is well-tuned. They cannot even see it. They buy your product because of what your agent does — the sales follow-ups it drafts, the tickets it resolves, the code it ships. The memory layer is upstream infrastructure that makes those things better, in exactly the way a database makes a web app better. Nobody's moat is their database.
That is the core test. Is this layer something your customers would pay you more for if you did it better than anyone else? For a tiny number of companies the answer is yes. For most, the answer is that a working memory layer is table stakes and a perfect one is invisible.
When building is genuinely the right call
The build-vs-buy conversation gets dishonest fast, usually in the direction of "just buy ours." So here is the honest other side. Building your own agent memory is the correct decision when one of these is true.
Memory is your product. If you are building a memory company — if the recall quality, the extraction model, the conflict resolution is the thing customers evaluate you on — then of course you build it. Outsourcing your core competency is malpractice. This is the clearest case and it is rare by definition.
You have genuinely unusual constraints. Air-gapped deployment. Data residency rules no vendor can satisfy. A latency budget so tight any network hop breaks the experience. Embeddings you must generate from a proprietary in-house model. If your requirements live outside what any provider offers, the buy option is off the table and the calculus is simple.
The thing you need is genuinely small and stays small. Not every agent needs the full iceberg. If your agent remembers a handful of user preferences and will never need contradiction handling, audit trails, or right-to-forget, the full machinery is overkill and a thin homegrown layer is the simpler solution — again, exactly what Anthropic recommends defaulting to. Buying a governed platform to store five preferences is its own kind of over-engineering. Just re-check the answer as you grow, because "small and stays small" has a way of quietly becoming false.
You want to learn it deeply. Sometimes the point is the education — a team that intends to go deep on retrieval long-term may rightly choose to build the first version themselves, eyes open to the cost, to earn the expertise. That is a legitimate strategic reason. Just name it as the reason, so the cost is a deliberate investment and not a surprise.
If none of those describe you, the lifting is undifferentiated, and you are about to spend your scarcest resource — senior engineering attention — on something your customers will never thank you for.
When buying wins
Buying wins in the common case: memory is infrastructure your agent rides on, your moat is the product on top, and you would rather your best engineers spend their quarter on the thing customers actually evaluate you on. Anthropic's own guidance on agents lands in the same place from the other direction — add complexity "only when it demonstrably improves outcomes," because "success in the LLM space isn't about building the most sophisticated system. It's about building the right system for your needs" (Building effective agents, Anthropic). A bespoke memory subsystem is sophistication you have to justify, and for most teams the justification is not there.
There is also a quieter argument for buying: the problem is more solved than it looks, which is precisely why you do not want to re-solve it. The industry has converged on memory as a first-class agent concern — Anthropic describes "agentic memory" where "the agent regularly writes notes persisted to memory outside of the context window" and pulls them back later, and has shipped a memory tool so agents can "build up knowledge bases over time, maintain project state across sessions, and reference previous work without keeping everything in context" (Effective context engineering for AI agents, Anthropic). The same shift gave us "context engineering," which Simon Willison endorses by way of Andrej Karpathy's definition: "the delicate art and science of filling the context window with just the right information" (Simon Willison, Context engineering). Getting the right information into the window at the right time is the actual hard part of agents now. You can spend your team solving it, or on using a solution.
Where AgentPrizm fits — and where it doesn't
Full disclosure: we make the buy option, so weigh this accordingly. AgentPrizm is a hosted memory layer for agents, reachable over a REST API or an MCP server. It handles the iceberg above: six typed memory kinds instead of one undifferentiated blob, hybrid semantic-plus-keyword recall, container scoping to keep one agent's memories out of another's, validity windows so facts can expire instead of rotting, audit trails on every memory, and a real forget path. There is a free tier, so the cost of finding out whether buying is right for you is your time, not your budget — see pricing and the docs.
And to keep my own honesty: AgentPrizm is a hosted service. If you need memory inside an air-gapped network, or you are a memory company yourself, or your needs are genuinely tiny — those are the build cases above, and a hosted layer is the wrong tool. I would rather you build the right thing than buy the wrong one.
The decision is not about capability. You can build agent memory; the demo proves it in an afternoon. The decision is about where you want your best people spending the next three years. If memory is your product, build it and own it completely. If memory is the road your product drives on, let someone else maintain the road — and go build the thing only you can build.