Attention Mechanisms and Self-AttentionAttention is the core mechanism that makes transformers work.
It replaces older sequential processing with a system where every token can directly interact with every other token.
---
1. The Core Idea of AttentionInstead of reading text one step at a time like a human sentence scan, the model does this:
- Look at all tokens at once
- Decide which tokens matter for each other token
- Weigh those relationships dynamically
So each token asks:
"What other tokens in this sequence are relevant to me?"
---
2. Query / Key / Value VectorsEach token is transformed into three vectors:
- Query (Q) → what this token is looking for
- Key (K) → what this token offers
- Value (V) → the information content of this token
---
How interaction worksFor each token:
Attention score = Query · Key
Then:
Weighted sum of Values = output representation
So:
- Query asks a question
- Key determines relevance
- Value supplies information
---
3. Attention WeightingThe model calculates how important each token is to every other token.
Example sentence:
"The cat sat on the mat because it was tired"
For the token "it":
- High weight → "cat"
- Low weight → "mat"
- Low weight → "sat"
These weights are normalized (usually via softmax):
All attention weights sum to 1.0
So each token distributes its focus across others.
---
4. Token RelationshipsAttention allows explicit modelling of relationships:
- Pronoun resolution ("it" → "cat")
- Subject-object links
- Long-range dependencies
- Syntax structure
Unlike older models, relationships are not sequential — they are direct pairwise connections.
---
5. Context WindowsAttention operates within a fixed window of tokens.
This is called the
context window.
Example:
You can only attend to the last N tokens
Implications:
- Model "memory" is limited to context size
- Older tokens outside window are inaccessible
- Long documents may be truncated or chunked
Modern models use very large windows (thousands to millions of tokens in some systems).
---
6. Parallel Processing vs RecurrenceRNNs / LSTMs (old approach):Token 1 → Token 2 → Token 3 → Token 4
(sequential processing)
Problems:
- Slow (no parallelism)
- Hard to retain long-range dependencies
- Gradient vanishing issues
---
Transformers (attention-based):All tokens processed simultaneously
Benefits:
- Massive parallelism (GPU efficient)
- No sequential bottleneck
- Direct long-range connections
---
7. Why Transformers Replaced RNNs/LSTMsTransformers won because attention solves key limitations:
1. Long-range dependency problemRNNs struggle to connect distant tokens.
Attention connects everything directly.
2. Parallel computationTransformers process entire sequences at once.
3. Better scalingPerformance improves predictably with more data and parameters.
4. Stable trainingLess vanishing gradient issues than recurrent models.
5. Expressive powerEvery token can interact with every other token in one step.
---
8. Intuition: What Attention Is DoingYou can think of attention as:
Each word in a sentence voting on which other words matter most
Or more formally:
Context-sensitive weighted information routing system
Each layer refines these relationships repeatedly.
---
9. Key InsightAttention replaces sequence with relationships.
So instead of:
A → B → C → D (chain processing)
you get:
A ↔ B ↔ C ↔ D (fully connected dynamic graph)
This shift is what makes modern language models capable of coherent reasoning across long contexts.