Memory & Context Engineering

Publish at:

AI agents are built around a fundamental constraint: the LLM has no memory. Every call is stateless — the model reads a prompt, produces an output, and forgets everything. The next call starts from zero.

Whatever the model knows about the conversation, the task, and the world must be packed into the context window before every single call. If it is not in the window, it does not exist.

Short-term memory holds what happened in the current conversation. Long-term memory retrieves knowledge from past conversations and external stores. Context engineering is the practice of deciding what goes into the window, how it is formatted, and what gets dropped when space runs out. Get this right and the agent behaves like it remembers everything. Get it wrong and it forgets what you told it two turns ago, hallucinates facts, or drowns in irrelevant detail.

The Context Window Is the Interface #

The context window is both a technical limitation and the entire interface between your code and the model's reasoning. Everything the model can think about must fit inside it: the system prompt, conversation history, tool schemas, retrieved documents, previous tool results, and instructions for the current step. There is nothing else. The model cannot peek at a database, scan a log file, or recall a previous session. It sees only what you put in front of it.

This means building an agent is, in large part, a context assembly problem. Each time the orchestration loop calls the model, your code constructs a prompt from scratch — selecting, formatting, and ordering the pieces the model needs for this step. The quality of that assembly directly determines the quality of the model's output.

┌─────────────────────────────────────────────┐
│              Context Window                 │
│                                             │
│  ┌─────────────────────────────────────┐    │
│  │         System Prompt               │    │
│  │  (persona, rules, constraints)      │    │
│  └─────────────────────────────────────┘    │
│  ┌─────────────────────────────────────┐    │
│  │         Tool Schemas                │    │
│  │  (available tools + descriptions)   │    │
│  └─────────────────────────────────────┘    │
│  ┌─────────────────────────────────────┐    │
│  │      Retrieved Context              │    │
│  │  (long-term memory, documents,      │    │
│  │   relevant past interactions)       │    │
│  └─────────────────────────────────────┘    │
│  ┌─────────────────────────────────────┐    │
│  │      Conversation History           │    │
│  │  (messages, tool calls, results)    │    │
│  └─────────────────────────────────────┘    │
│  ┌─────────────────────────────────────┐    │
│  │      Current Step                   │    │
│  │  (user goal or next-step prompt)    │    │
│  └─────────────────────────────────────┘    │
│                                             │
└─────────────────────────────────────────────┘

Each block competes for space. A generous system prompt leaves less room for conversation history. A long tool result pushes out retrieved context. You are always making trade-offs, and the right trade-off depends on the step.

Short-Term Memory #

Short-term memory is what happened in this conversation — the messages exchanged, the tool calls made, and the results returned. It is the agent's working memory, the scratchpad that lets it reason across multiple turns instead of starting fresh each time. We covered the basics in the building blocks overview — here we go deeper into the engineering.

In most implementations, short-term memory is a list of message objects that grows as the conversation progresses. Each turn appends entries: the user's message, the model's response (either a tool call or a final answer), and any tool results. The orchestration loop feeds this list into the context window before each model call.

state = {
    "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "What were last quarter's sales for ACME?"},
        {"role": "assistant", "tool_call": {"name": "query_sales", "args": {"company": "ACME", "period": "Q1-2026"}}},
        {"role": "tool", "name": "query_sales", "content": '{"total": 4200000, "units": 18400}'},
        {"role": "assistant", "content": "ACME's Q1 2026 sales totaled $4.2M across 18,400 units."},
        {"role": "user", "content": "How does that compare to the same quarter last year?"},
    ]
}

This list is the agent's memory of the conversation. On every model call, you feed some or all of it into the context window. The model reads it, understands what has already happened, and decides the next step.

The problem is that this list only grows. A ten-turn conversation with tool calls can easily reach thousands of tokens. A fifty-turn session with verbose tool results can fill the entire context window before the model even gets to the current question. You need strategies for managing this growth, and the right strategy depends on what the conversation looks like.

Managing the Window #

When the conversation history outgrows the context window, something has to give. There are several approaches, each with different failure modes.

Sliding window. Keep the most recent N messages and drop everything older. This is simple and cheap, but it means the agent forgets the beginning of long conversations. If the user set an important constraint ten turns ago, it is gone. Sliding windows work for conversations where recent context is all that matters — quick Q&A, interactive debugging — but they fail for tasks that build on earlier decisions.

Summarization. Before dropping old messages, summarize them into a condensed block. The summary replaces the original messages in the context window, preserving key facts and decisions while freeing space.

def compact_history(messages: list[dict], max_tokens: int) -> list[dict]:
    system = [m for m in messages if m["role"] == "system"]
    recent = messages[-6:]  # keep last few turns verbatim
    older = messages[len(system):-6]

    if count_tokens(older) < max_tokens:
        return messages  # no compaction needed

    summary_prompt = (
        "Summarize the following conversation history. "
        "Preserve all key decisions, facts, tool results, "
        "and user constraints. Be concise.\n\n"
        + format_messages(older)
    )
    summary = call_model(summary_prompt)

    return system + [{"role": "system", "content": f"Previous conversation summary:\n{summary}"}] + recent

Summarization preserves more information than a sliding window, but it costs a model call, adds latency, and can lose detail. A summary that drops the nuance of why a decision was made can lead the agent to revisit or contradict that decision later. Every summarization step is a lossy compression, and the loss accumulates over a long conversation.

Selective pruning. Instead of summarizing everything or dropping by position, selectively remove messages that are no longer relevant. Intermediate tool calls that returned errors and were retried successfully, verbose tool results that have already been incorporated into the model's answer, and exploratory turns that led nowhere can often be removed without losing anything important. This is more surgical than summarization, but it requires heuristics about what is "safe" to drop — and getting those heuristics wrong means silently losing context the model needed.

Token budgeting. Assign a token budget to each section of the context window: a fixed budget for the system prompt, a budget for tool schemas, a budget for retrieved context, and the remainder for conversation history. When a section exceeds its budget, apply the appropriate compression strategy — summarize history, truncate tool results, or reduce the number of retrieved documents.

BUDGET = {
    "system_prompt": 800,
    "tool_schemas": 600,
    "retrieved_context": 2000,
    "history": 4000,
    "current_step": 500,
}

def build_context(system: str, tools: list, retrieved: list, history: list, step: str) -> str:
    sections = [
        truncate(system, BUDGET["system_prompt"]),
        truncate(format_tools(tools), BUDGET["tool_schemas"]),
        truncate(format_retrieved(retrieved), BUDGET["retrieved_context"]),
        format_messages(compact_history(history, BUDGET["history"])),
        truncate(step, BUDGET["current_step"]),
    ]
    return assemble(sections)

In practice, most production agents combine these techniques. They keep recent turns verbatim, summarize older ones, drop failed intermediate steps, and enforce hard budgets per section. The specific mix depends on the task — a coding agent that needs to track file changes across many steps has different retention needs than a customer support agent handling a single question. The goal is to preserve what the model needs for the next step.

Long-Term Memory #

Short-term memory dies when the conversation ends. Long-term memory survives across conversations and sessions — it is how an agent "knows" things it learned last week or remembers a user's preferences from a previous interaction. Without it, every conversation starts cold, and the agent has no way to build on past work.

Long-term memory requires an external store. The conversation history lives in your runtime; long-term memory lives in a database, a vector store, or both. The agent writes to it selectively during or after conversations, and reads from it at the start of each new conversation or when it needs facts that are not in the current context. The distinction matters for the architecture: short-term memory is a data structure in your process, long-term memory is a service you query.

There are several types of information worth persisting across sessions.

User preferences and profile data. If a user tells the agent "I prefer metric units" or "always use formal tone," that should survive beyond the current session. Storing it means the agent does not need to be told again.

Learned facts. During a conversation, the agent might learn that "Project Atlas uses PostgreSQL 15" or "the deployment pipeline runs on Jenkins." These are facts that could be relevant in future conversations about the same project.

Episodic memory. Summaries of past interactions — what was discussed, what decisions were made, what tools were used. This lets the agent reference previous conversations without replaying them in full.

Procedural memory. Patterns the agent has learned about how to accomplish tasks — which tools work best for which queries, which approaches tend to fail, which phrasings produce better results from specific APIs.

The Retrieval Pipeline #

Writing to memory is easy. The hard part is retrieval — getting the right information back at the right time without flooding the context window with irrelevant results. A noisy retrieval step is often worse than no retrieval at all, because it displaces more valuable content.

A typical retrieval pipeline for long-term memory looks like this:

  current goal + conversation state
              │
              ▼
     ┌───────────────────┐
     │  Generate query   │  (extract key terms, entities,
     │  from context     │   intent from current step)
     └────────┬──────────┘
              │
              ▼
     ┌───────────────────┐
     │  Retrieve         │  (vector search, keyword search,
     │  candidates       │   or hybrid — return top-k)
     └────────┬──────────┘
              │
              ▼
     ┌───────────────────┐
     │  Rerank and       │  (score relevance against the
     │  filter           │   current step, not just the query)
     └────────┬──────────┘
              │
              ▼
     ┌───────────────────┐
     │  Format and       │  (pack into the retrieved context
     │  inject           │   section of the context window)
     └───────────────────┘

Each stage involves trade-offs. Retrieving too many candidates floods the context with noise, pushing out more valuable information. Retrieving too few risks missing something critical. Reranking improves precision but adds latency — a reranker model call is not free. And the format matters: a wall of unstructured text is harder for the model to use than clearly labeled, structured entries.

def retrieve_long_term_context(goal: str, user_id: str, top_k: int = 5) -> list[dict]:
    query_embedding = embed(goal)
    candidates = vector_store.search(query_embedding, user_id=user_id, limit=top_k * 3)

    # Rerank based on relevance to the current goal
    scored = [(c, rerank_score(goal, c["text"])) for c in candidates]
    scored.sort(key=lambda x: x[1], reverse=True)

    return [c for c, score in scored[:top_k] if score > 0.4]

The threshold on the relevance score matters. Returning a memory with a low score is worse than returning nothing — a weakly relevant memory can mislead the model more than a gap would. When in doubt, retrieve less. A false positive in retrieval costs you context window space and risks confusing the model.

What to Keep and What to Drop #

Not everything the agent encounters should be written to long-term memory. An agent that indiscriminately stores every conversation turn, every tool result, and every intermediate reasoning step will build a noisy memory store that degrades retrieval quality over time. The retrieval problem gets worse as the store grows, because more candidates compete for the top-k slots.

Good memory management requires explicit write triggers — rules that decide when something is worth persisting. These can be heuristic, model-driven, or both.

Heuristic triggers:

  • User explicitly states a preference or correction ("I prefer CSV exports")
  • A new entity or relationship is established ("Project Atlas is owned by the infrastructure team")
  • A multi-step task completes successfully — store the summary and outcome
  • A tool call pattern repeatedly succeeds or fails — store the pattern

Model-driven triggers: At the end of a conversation or after a significant event, ask the model what is worth remembering.

def extract_memories(conversation: list[dict]) -> list[str]:
    prompt = (
        "Review this conversation and extract facts worth remembering "
        "for future conversations. Include: user preferences, key decisions, "
        "project facts, and successful patterns. Return a JSON array of strings. "
        "Do not include ephemeral details or intermediate reasoning.\n\n"
        + format_messages(conversation)
    )
    response = call_model(prompt)
    return parse_json(response.text)

Expiration is the other half of memory management. Facts go stale. A user's preferred programming language may change. A project's database version gets upgraded. Long-term memory needs a mechanism for aging out entries — either explicit TTLs, or a confidence decay that reduces an entry's retrieval score over time, or periodic review where the agent (or a background process) evaluates whether stored facts are still valid.

Deduplication matters too. If the agent learns the same fact in ten different conversations, you do not want ten near-identical entries in the memory store competing for retrieval slots. A deduplication step at write time — checking whether a sufficiently similar entry already exists — keeps the store clean.

Context Engineering #

Context engineering is the practice of assembling the best possible input for each model call. It goes beyond memory management — it encompasses how you structure the prompt, what format you use to present information, and how you order the pieces within the window.

The standard approach uses the chat-style message format that model APIs expect: an array of system, user, assistant, and tool messages. This works well for conversational agents, but it is not the only option. For agents doing complex multi-step work, you can get better results by constructing a single structured context that packs everything the model needs in a format optimized for the task.

def build_structured_context(thread: list[dict]) -> str:
    parts = []
    for event in thread:
        tag = event["type"]
        data = event["data"] if isinstance(event["data"], str) else to_yaml(event["data"])
        parts.append(f"<{tag}>\n{data}\n</{tag}>")
    return "\n\n".join(parts)

This XML-style tagging approach has several advantages. It makes boundaries between events explicit — the model can clearly see where a tool result ends and the next action begins. It lets you control information density by formatting events in whatever structure works best (YAML, JSON, plain text). And it gives you freedom to restructure the context between steps — removing resolved errors, collapsing redundant entries, or promoting important facts to the top.

Ordering within the context window affects model attention. Models tend to weight information at the beginning and end of the context more heavily than what sits in the middle (a phenomenon sometimes called the "lost in the middle" effect). This means your most important context — the system prompt, the current step, and the most relevant retrieved facts — should be at the edges. Conversation history and lower-priority context go in the middle.

Information density matters as much as what you include. Consider a tool that returns a 500-line JSON payload. Dumping the raw response into the context window wastes tokens and buries the relevant data. Extracting the key fields and formatting them concisely gives the model the same information in a fraction of the space.

def format_tool_result(tool_name: str, raw_result: dict) -> str:
    if tool_name == "query_sales":
        # Extract only what the model needs
        return (
            f"Sales results: total=${raw_result['total']:,.0f}, "
            f"units={raw_result['units']:,}"
        )
    # Fallback: truncate raw result
    text = json.dumps(raw_result, indent=2)
    if len(text) > 2000:
        return text[:2000] + "\n... (truncated)"
    return text

This is the same instinct behind good API design — give the consumer exactly what they need, nothing more. The model is the consumer of your context window.

Stateless Agents and External State #

A pattern worth highlighting: treating the agent as a stateless function that reads its entire state from an external store at the start of each step and writes it back when done. The agent itself holds no in-memory state between invocations.

┌────────────┐      ┌──────────────┐     ┌─────────────┐
│  Trigger   │────▶ │  Load state  │────▶│  Run agent  │
│ (message,  │      │  from store  │     │  (stateless │
│  webhook,  │      │              │     │   function) │
│  cron)     │      └──────────────┘     └───────┬─────┘
└────────────┘                                   │
                                                 ▼
                                        ┌──────────────┐
                                        │  Save state  │
                                        │  to store    │
                                        └──────────────┘

This has practical consequences for memory. The conversation history, the agent's execution trace, and any intermediate state live in a database — not in the runtime's memory. Each invocation loads the relevant state, runs one or more model calls, and writes the updated state back.

The benefit is resilience. If the agent process crashes mid-task, you lose nothing — the state is in the store. You can restart the agent, load the last saved state, and resume from where it left off. You can also scale horizontally — any instance of the agent can pick up any conversation, because the state is external.

def handle_event(event: dict, thread_id: str):
    # Load full state from external store
    thread = db.load_thread(thread_id)
    thread.events.append(event)

    # Run the agent loop (stateless — all context comes from thread)
    while True:
        next_step = determine_next_step(build_context(thread))
        thread.events.append({"type": next_step.intent, "data": next_step})

        if next_step.intent == "done":
            break
        if next_step.intent == "request_human_input":
            db.save_thread(thread)  # persist and pause
            notify_human(next_step)
            return

        result = execute_step(next_step)
        thread.events.append({"type": f"{next_step.intent}_result", "data": result})

    db.save_thread(thread)
    return thread.events[-1]

This pattern also makes it natural to implement pause and resume. When the agent needs human input or must wait for a long-running task, it saves its state and exits the loop. When the response arrives — via a webhook, a message queue, or a manual trigger — the runtime loads the state and continues. The agent does not need to hold a process open waiting for the response.

Trade-offs #

Memory and context engineering are full of trade-offs that compound with each other.

Summarization vs. fidelity. Summaries save tokens but lose nuance. A summary that says "the user discussed deployment options" drops the specific detail that they rejected Kubernetes for cost reasons. The loss compounds across a long conversation, and you cannot get the nuance back once it is gone.

Retrieval latency vs. relevance. A reranker improves retrieval quality but adds a model call per retrieval step. For agents that retrieve context every turn, this overhead compounds. Simpler retrieval (pure vector search, no reranking) is faster but returns noisier results.

Memory store size vs. retrieval quality. The more you store, the harder it is to find what matters. A clean, curated memory store with hundreds of entries outperforms a noisy one with tens of thousands. Write triggers and deduplication are not luxuries — they are what keep retrieval from degrading over time.

Context density vs. robustness. A tightly packed context window leaves no room for error. If an unexpected tool result is slightly larger than anticipated, something gets truncated. Leaving headroom in your token budgets is more robust but means less information per call.

Custom formats vs. compatibility. Building your own context format can improve information density and model performance, but it locks you into maintaining that format and testing how different models handle it. Standard message-based formats work reliably across providers, even if they are less token-efficient.

The right mix depends on the agent's workload. A coding agent that operates on large files across dozens of steps needs aggressive context management — summarizing aggressively, pruning completed steps, and budgeting tokens carefully. A support agent handling single-turn questions with some personalization needs a lighter touch — recent history plus a few retrieved user preferences.

Conclusion #

Memory is what turns a stateless model into an agent that can carry context across turns and sessions. Context engineering is the practice of assembling the right information, in the right format, at the right time — and it is arguably the highest-leverage skill in agent development.

Key takeaways:

  • The context window is the only interface between your code and the model — if it is not in the window, the model does not know about it
  • Short-term memory is the conversation history; managing it means choosing what to keep, summarize, or drop as the conversation grows
  • Long-term memory requires an external store, explicit write triggers, expiration rules, and deduplication to stay useful over time
  • Retrieval quality degrades as the memory store grows — curate aggressively and set relevance thresholds
  • Context engineering is more than memory management — it includes formatting, ordering, information density, and token budgeting
  • Treating the agent as a stateless function with external state gives you resilience, horizontal scaling, and natural pause/resume
  • Every memory decision is a trade-off between fidelity and efficiency — the right balance depends on the task