LLM Training vs Inference
A Systems Programmer's View
The Big PictureA Large Language Model (LLM) operates in two completely different modes:
- Training — learning from data.
- Inference — using what was learned.
The same neural network architecture is used in both cases.
The difference is what happens
after a prediction is made.
Inference (Normal ChatGPT Usage)Inference is what happens when you ask a question and the model generates a reply.
Step 1 — InputYou type:
Why would the stored weight parameters be so specific?
Step 2 — TokenisationThe text is broken into tokens:
Why
would
the
stored
weight
parameters
...
Tokens are not necessarily whole words. Common words may be one token, while rare words may become several tokens.
Step 3 — EmbeddingsEach token is converted into a vector.
Think:
Token -> Vector (list of numbers)
The model no longer processes words directly. It processes mathematical representations of words.
Step 4 — Transformer ProcessingThe Transformer examines:
- The current token.
- Previous tokens.
- The entire context window.
Using attention mechanisms, it determines the most likely next token.
Step 5 — Generate Next TokenExample:
Because
Step 6 — Feed Back Into ContextThe growing conversation now becomes:
Why would the stored weight parameters be so specific?
Because
The entire sequence is processed again.
Step 7 — RepeatGenerate:
the
Then:
network
Then:
...
This continues until the reply is complete.
Important PointDuring inference:
The weights never change.
The model is effectively read-only.
Inference is simply:
Input
↓
Prediction
↓
Output
No learning occurs.
Understand that there is a compute graph and it is the full map of how the input was transformed into the final output during the forward pass. It contains every operation as a node (each node knows what it output during the forward pass and what inputs it depended on.
Back-prop moves backwards through the compute graph using intermediate values.
Training (Learning Phase)Training starts similarly but contains additional steps.
Step 1 — InputTraining example:
The capital of France is Paris
Step 2 — TokenisationThe text becomes tokens.
Step 3 — Forward PassThe model predicts the next token.
For example:
The capital of France is London
Step 4 — Compare Against Known AnswerExpected:
Paris
Predicted:
London
An error exists.
Step 5 — Compute LossLoss is a numerical measure of error.
Think:
Loss = How Wrong Was The Prediction?
Example:
Loss = 0.83
Higher loss means larger error.
Lower loss means better prediction.
Step 6 — BackpropagationThis is where training differs from inference.
The system asks:
Which weights contributed to this error?
The error signal is propagated backwards through the network.
Step 7 — Calculate Gradients Efficiently via Backprogogation.For each weight:
If this weight was slightly larger,
would the error improve or worsen?
If this weight was slightly smaller,
would the error improve or worsen?
These calculations produce gradients.
A gradient tells the optimiser which direction to move each weight.
Step 8 — Adjust WeightsTiny changes are made:
Weight A += 0.00001
Weight B -= 0.00003
Weight C += 0.000001
Not huge changes.
Tiny changes.
Step 9 — Repeat This Loop Billions of TimesUntil Convergence:
1. - Loss stops decreasing
2. - It performs well on unseen validation data
3. - Further weight updates no longer meaningfully reduce error on new data.
Step 10 — Next ExampleThe process repeats.
Again.
And again.
And again.
For months.
Why The Adjustments Are TinyLarge weight changes would cause instability.
The optimiser therefore takes very small steps.
Think of steering a ship.
Small corrections made continuously are usually more accurate than dramatic turns.
The objective is gradual convergence toward lower error.
OverfittingOverfitting occurs when the model learns the training examples too specifically (ie, it remembers instead of learning)
Instead of learning:
The underlying rule
it learns:
The exact examples
Example:
Memorising:
2 x 2 = 4
3 x 3 = 9
4 x 4 = 16
Generalising:
Understanding multiplication itself.
The goal of training is not merely memorisation.
The goal is generalisation.
The Whole DistributionResearchers often refer to:
The Distribution
This means:
The entire population of examples the model may encounter.
Not merely the training data.
For language models, this includes:
- Books
- Articles
- Technical manuals
- Emails
- Forum posts
- Conversations
- Documentation
The training set is only a sample of the larger distribution.
The model must perform well across the entire distribution, not merely on examples it has already seen.
Samuel's Checkers vs Modern LLMsArthur Samuel's Checkers Program:
Board Position
↓
Evaluation Function
↓
Move Selection
↓
Game Result
↓
Adjust Weights
Modern LLM:
Tokens
↓
Prediction
↓
Error
↓
Adjust Weights
The basic learning loop is remarkably similar.
The major difference is scale.
Samuel's program contained a relatively small number of meaningful parameters.
Modern LLMs contain billions of parameters.
A Systems Programmer's SummaryTrainingREAD TRAINING RECORD
RUN MODEL
COMPARE OUTPUT TO KNOWN ANSWER
COMPUTE ERROR
UPDATE PARAMETER TABLE
REPEAT
InferenceREAD INPUT
RUN MODEL
EMIT OUTPUT
REPEAT UNTIL REPLY COMPLETE
The parameter table is not modified during inference.
The Single Most Important DifferenceInference asks:Given these weights, what is the most likely next token?
Training asks:How should these weights be adjusted so future predictions become less wrong?
Everything else in modern deep learning is built around these two processes.