Building a session history manager for LLM agents
How to build a production-grade session history manager for LLM agents
Where This Comes From
I’ve been building an agentic solution that involves multiple sub-agents orchestrated together - think a main planning agent that delegates to specialised sub-agents for RAG retrieval, tool execution, and code generation. Each sub-agent adds its own system prompt, tool schemas, retrieved documents, and generated code blocks to the conversation. A single user request can fan out into dozens of internal messages before a final answer surfaces.
It worked beautifully in demos. Then real users started having actual conversations.
Within 15–20 turns, the context window was full. Users started seeing this in production:
context limit exceeded
It was because every tool call, every chunk of RAG context, every block of generated code, and every sub-agent handoff was silently piling up in the session history.
I took a step back and built a dedicated context management engine to solve this for good. This post captures the key learnings, common pitfalls, and practical patterns I discovered.
The Problem (and Why “400K Context” Is Misleading)
Let’s take the model we’re actually using: GPT-5.3-codex. It advertises a 400K total context length. Sounds massive but here’s the breakdown:
| Spec | Tokens |
|---|---|
| Input context limit | 272,000 |
| Output (max completion) | 128,000 |
| Total (input + output) | 400,000 |
That 400K headline number is not what you get to play with. The output budget is reserved for the model’s response. Your session history, system prompt, tool schemas, RAG context - all of it must fit inside the 272K input limit. That’s the number that matters for session management.
And 272K still sounds like a lot - until you’re running a multi-agent pipeline. Here’s what a typical turn looks like under the hood:
| Message | Role | Approx. tokens |
|---|---|---|
| User asks a question | user |
~50 |
| Planning agent reasons about which sub-agent to call | assistant |
~200 |
| RAG sub-agent retrieves 3 document chunks | tool |
~2,000 |
| Code-gen sub-agent produces a solution | assistant |
~1,500 |
| Tool call to execute/validate the code | tool |
~500 |
| Final synthesised answer | assistant |
~300 |
That’s ~4,500 tokens for a single turn. Multiply by 40-50 turns in an extended working session and you’ve blown past 200K tokens without the user typing more than a few sentences. The context doesn’t grow linearly with conversation length; it balloons because of the compound effect of sub-agents, RAG payloads, and generated code.
Pitfall #1: Ignoring the Hidden Token Consumers
Most developers look at the spec sheet - “400K context!” - and assume they have that much room for chat. You don’t. The input limit is 272K, and even that is eroded by things you never see in the chat:
| Hidden consumer | Typical cost |
|---|---|
| System prompt | 1000 - 1500 tokens |
| Tool/function schemas (multiple agents) | 1,000 - 5,000 tokens |
| Reserved output tokens | up to 128,000 tokens |
With GPT-5.3-codex, if you reserve even a modest 16K for output and burn 3K on tool schemas across your sub-agents, your real budget is ~252K, not 272K.
# The ACTUAL budget formula
input_context_limit = 272_000
system_tokens = count_tokens(system_prompt)
reserved = max_output_tokens + tool_schema_overhead
remaining = input_context_limit - system_tokens - reserved
budget = int(remaining * safety_fraction) # 0.80 recommended
Lesson: Always compute your budget against the input context limit, not the total. And never assume the full input window is yours either - subtract system prompts, tool schemas, and output reservations first.
Pitfall #2: Trimming at the Wrong Boundary
The naive approach is to just drop the oldest messages until you’re under budget. But chat histories aren’t a flat list of independent messages - they contain logical groups that must stay together:
user: "What's the weather in Paris?"
assistant: [tool_call: get_weather("Paris")] ← function call
tool: "22°C and sunny" ← function result
assistant: "It's 22°C and sunny in Paris!" ← final reply
If you trim between the tool call and the tool result, the model sees an orphaned function call with no response - and it hallucinates or errors out.
The fix: align your trim boundary to the next role == "user" message. This guarantees you never split a tool-call sequence:
# Walk forward to the next "user" message
aligned = None
for j in range(keep_from, len(items)):
if items[j].get("role") == "user":
aligned = j
break
if aligned is None:
# No safe boundary exists - nuke the session
await session.clear_session()
return
keep_from = aligned
Lesson: Never trim mid-turn. Always snap to a role boundary.
Pitfall #3: Losing Context Silently
Trimming old messages solves the token problem - but destroys context. If the user mentioned their name, a project requirement, or a design decision 40 messages ago, that information is just gone.
The fix is to summarize before you trim:
conversation_text = "\n".join(
f"{item['role']}: {item['content']}" for item in trimmed_items
)
response = await client.chat.completions.create(
model=summary_model,
messages=[{
"role": "user",
"content": (
"Summarise this conversation excerpt in 2-4 sentences, "
"preserving all key facts, decisions, and information "
"the user may refer back to.\n\n"
f"{conversation_text}"
),
}],
max_tokens=300,
temperature=0.3,
)
summary = response.choices[0].message.content.strip()
Then inject the summary back into the session as the very first message:
await session.clear_session()
new_items = [{
"role": "user",
"content": f"[Previous conversation summary]: {summary}",
}]
new_items.extend(kept_items)
await session.add_items(new_items)
Your system prompt should explicitly tell the model to trust these summaries:
When you see a '[Previous conversation summary]' message,
treat it as a reliable recap of the earlier conversation.
Use it to maintain continuity.
Lesson: Trim the tokens, but compress the knowledge. A 3-sentence summary is worth 10,000 trimmed tokens.
Pitfall #4: Letting Summarization Failures Kill the Session
Summarization uses an LLM call, and LLM calls can fail - network issues, rate limits, content filters. If your summarization crashes and you haven’t trimmed, the next agent call will hit the context limit anyway.
Always trim first, summarize second, and treat summarization as best-effort:
summary_text = None
try:
summary_text = await summarize(trimmed_items)
except Exception:
logger.exception("Summarization failed - trimming without summary")
# Proceed with or without summary
await rewrite_session(summary_text, kept_items)
Lesson: Graceful degradation > fragile correctness.
Good Practice: Make the Budget Visible
Once I added a React-based session visualizer connected to the backend via WebSocket, debugging became 10x easier. I could literally watch the token budget shrink in real time:
# In your context manager, emit events via a callback:
if event_callback:
await event_callback({
"type": "budget_info",
"context_window": context_window,
"history_tokens": history_tokens,
"budget": budget,
})
I’d strongly recommend building a lightweight dashboard when developing agents. Watching the numbers change as you chat makes budget problems obvious before they become runtime errors.
Good Practice: Apply a Safety Fraction
Even with precise token counting, there’s always a margin of error - tiktoken counts don’t perfectly match the model’s internal tokenizer, and API overhead adds a few tokens per message.
A simple 0.80× safety fraction on the remaining budget gives you a comfortable 10% buffer:
budget = int(remaining * 0.80)
This single line has prevented more crashes than any other piece of code in the project.
The Complete Algorithm
Here’s the full strategy distilled into 9 steps:
- Count tokens for system prompt + full session history
- Reserve output tokens + tool-schema overhead
- Apply safety fraction (0.80) to the remaining budget
- Keep the newest messages that fit within budget
- Align trim boundary forward to the next
role == "user"message - Summarize the trimmed messages using an LLM (best-effort)
- Inject the summary as the first message in the rewritten session
- Clear session entirely if budget ≤ 0 or no safe boundary exists
- Continue trimming even if summarization fails
Run this function before every agent turn, and your agent will handle conversations of unlimited length without ever hitting the context ceiling.
TL;DR
| Pitfall | Fix |
|---|---|
| Assuming the full context window is available | Subtract system prompt, tool schemas, and output reservation |
| Trimming mid-turn (splitting tool calls) | Snap trim boundary to role == "user" |
| Losing important context when trimming | Summarize trimmed messages before discarding |
| Summarization failure crashing the session | Treat summarization as best-effort; always trim |
| Debugging budget issues blindly | Build a real-time visualizer dashboard |
If you’re building with the OpenAI Agents SDK (or any LLM framework), these patterns will save you from the most common - and most frustrating - production failures.
The full POC code (Python backend + React visualizer) is available on my GitHub.