Author Topic: Context Windows and Memory (Read 327 times)

Chip · « **on:** May 27, 2026, 09:49:58 PM »

https://www.youtube.com/embed/D6eMrUxEuQw

Context Windows and Memory

This is where most confusion about chatbot “memory” comes from.

LLMs don’t have human memory. They operate on a fixed-size input buffer called a context window.

---

1. What context windows are

A context window is the maximum number of tokens the model can “see” at once.

Everything the model uses to generate an answer must fit inside this window:

Code: [Select]

System prompt + conversation + user input + tool outputs

If something is outside the window, it does not exist to the model.

---

2. Token limits

Context windows are measured in tokens, not words.

Typical ranges:

Small models: a few thousand tokens
Modern LLMs: tens of thousands
Frontier systems: 100k+ tokens (sometimes more)

But regardless of size:

Code: [Select]

There is always a hard cutoff

Once full, older tokens must be removed or compressed.

---

3. Sliding attention windows

When the conversation exceeds capacity, systems use strategies like sliding windows.

This means:

Keep the most recent tokens
Drop or compress older tokens
Shift attention focus forward

So the model effectively “moves” its attention frame along the conversation.

Code: [Select]

[OLD CHAT] → dropped
[MID CHAT] → compressed or removed
[NEW CHAT] → fully visible

---

4. Conversation truncation

When the window fills up:

Old messages are cut off
Only recent context remains
Important early details can disappear

This is not forgetting in a human sense — it is literal data loss from the input.

If it’s not in the prompt anymore, it cannot be used.

---

5. Why models "forget"

Models do not have persistent memory during inference.

They "forget" because:

They never stored the conversation internally
They only operate on the current input tokens
Old context falls outside the window

So forgetting is not failure — it is architecture.

Code: [Select]

No context = no access = no memory

---

6. Persistent memory systems

Some systems add external memory layers on top of the model.

These may include:

Databases of user facts
Vector embeddings of past conversations
Summarised long-term memory stores

At runtime:

Code: [Select]

User query → memory retrieval → inject into context window

So memory is not inside the model — it is external and re-injected.

---

7. RAG vs memory

These are often confused but are different systems.

RAG (Retrieval-Augmented Generation):

Fetches external documents (web, database, files)
Uses embeddings to find relevant info
Injects retrieved content into context window

Purpose:

Code: [Select]

Ground answers in external knowledge

---

Memory systems:

Store user-specific or session-specific information
Designed for continuity across sessions
Often summarised or selectively saved

Purpose:

Code: [Select]

Remember user-specific context over time

---

Key difference

Code: [Select]

RAG = knowledge retrieval (external truth source)
Memory = user/context retention (personal continuity)

---

8. Key Insight

A chatbot does not have a mind that remembers.

It has:

Code: [Select]

A fixed-size working window + optional external retrieval systems

Everything it “knows” during a response must either:

Be inside the context window
Be retrieved via RAG
Be injected as memory

Otherwise, it does not exist for that moment of computation.

dopetalk

News:

dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse

Author Topic: Context Windows and Memory (Read 327 times)

Chip (OP)

Context Windows and Memory

Related Topics

Need help or a chat ?

If you need any help or a chat then IM/PM or email me, Chip

dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse