Bridge: From a basic neural network to a Transformer
1. Take your simple model (feed-forward neural network)
Structure
Input → Layer → Layer → Layer → Output
How it works
- Each layer applies a fixed transformation:
h = activation(Wx + b)
- Information flows strictly forward
- Each step only sees the output of the previous step
Key limitation
- Inputs are mostly processed through fixed mixing
- Relationships between tokens/features are implicit and static
- No mechanism for directly selecting what matters from other inputs
---
2. What changes in a Transformer
The transformer replaces:
- “stacked feature extraction”
with:
- “explicit interaction between all elements”
---
Core new mechanism: Attention
Instead of only:
h = activation(Wx + b)
we compute:
Attention(Q, K, V)
Where:
- Q = Query (what this token is looking for)
- K = Key (what each token offers)
- V = Value (the information carried)
---
3. Transformer data flow
Step 1: Input tokens
"The cat sat on the mat"
Converted into embeddings:
Token → Vector
---
Step 2: Create Q, K, V
Each token is projected into three roles:
X → Q (what I want)
X → K (what I am)
X → V (what I contain)
---
Step 3: Attention computation
Each token compares itself with every other token:
similarity = Q · K
weights = softmax(similarity)
output = Σ (weights × V)
Result:
- Each token becomes a weighted mixture of all other tokens
- Information is dynamically shared across the whole sequence
---
Step 4: Multi-head attention
Instead of one attention mechanism:
- multiple heads run in parallel
Each head can specialise:
- syntax relationships
- semantic meaning
- long-range dependencies
---
Step 5: Feed-forward layer
Still present, but now operating on context-aware vectors:
FFN(x) = activation(W2 · activation(W1x))
---
Step 6: Residual connections + normalization
Residual connection
x = x + Attention(x)
Layer normalization
- stabilises training
- prevents exploding/vanishing values
---
4. Full Transformer pipeline
Tokens
↓
Embeddings
↓
Self-Attention
↓
Feed-Forward Network
↓
(repeat N layers)
↓
Output logits
---
5. Key conceptual difference
Feed-forward network
- fixed, layered computation
- local feature mixing
- no explicit interaction between inputs
Transformer
- every token interacts with every other token
- relationships computed dynamically per input
- computation is based on context, not just depth
---
6. One-line summary
Transformers replace:
“stacked transformations of data”
with:
“dynamic, fully connected interaction between all elements at every layer”
Generated by ChatGPT