Context Windows and MemoryThis is where most confusion about chatbot “memory” comes from.
LLMs don’t have human memory. They operate on a fixed-size input buffer called a context window.
---
1. What context windows areA context window is the maximum number of tokens the model can “see” at once.
Everything the model uses to generate an answer must fit inside this window:
System prompt + conversation + user input + tool outputs
If something is outside the window, it does not exist to the model.
---
2. Token limitsContext windows are measured in tokens, not words.
Typical ranges:
- Small models: a few thousand tokens
- Modern LLMs: tens of thousands
- Frontier systems: 100k+ tokens (sometimes more)
But regardless of size:
There is always a hard cutoff
Once full, older tokens must be removed or compressed.
---
3. Sliding attention windowsWhen the conversation exceeds capacity, systems use strategies like sliding windows.
This means:
- Keep the most recent tokens
- Drop or compress older tokens
- Shift attention focus forward
So the model effectively “moves” its attention frame along the conversation.
[OLD CHAT] → dropped
[MID CHAT] → compressed or removed
[NEW CHAT] → fully visible
---
4. Conversation truncationWhen the window fills up:
- Old messages are cut off
- Only recent context remains
- Important early details can disappear
This is not forgetting in a human sense — it is literal data loss from the input.
If it’s not in the prompt anymore, it cannot be used.
---
5. Why models "forget"Models do not have persistent memory during inference.
They "forget" because:
- They never stored the conversation internally
- They only operate on the current input tokens
- Old context falls outside the window
So forgetting is not failure — it is architecture.
No context = no access = no memory
---
6. Persistent memory systemsSome systems add external memory layers on top of the model.
These may include:
- Databases of user facts
- Vector embeddings of past conversations
- Summarised long-term memory stores
At runtime:
User query → memory retrieval → inject into context window
So memory is not inside the model — it is external and re-injected.
---
7. RAG vs memoryThese are often confused but are different systems.
RAG (Retrieval-Augmented Generation):- Fetches external documents (web, database, files)
- Uses embeddings to find relevant info
- Injects retrieved content into context window
Purpose:
Ground answers in external knowledge
---
Memory systems:- Store user-specific or session-specific information
- Designed for continuity across sessions
- Often summarised or selectively saved
Purpose:
Remember user-specific context over time
---
Key differenceRAG = knowledge retrieval (external truth source)
Memory = user/context retention (personal continuity)
---
8. Key InsightA chatbot does not have a mind that remembers.
It has:
A fixed-size working window + optional external retrieval systems
Everything it “knows” during a response must either:
- Be inside the context window
- Be retrieved via RAG
- Be injected as memory
Otherwise, it does not exist for that moment of computation.