dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse


Our Discord Notification Server invitation link is https://discord.gg/jB2qmRrxyD

Author Topic: How Large Language Models Learn — A Simple Explanation  (Read 27 times)

Online Chip (OP)

  • Server Admin
  • Hero Member
  • *****
  • Administrator
  • *****
  • Join Date: Dec 2014
  • Location: Australia
  • Posts: 7215
  • Reputation Power: 0
  • Chip has hidden their reputation power
  • Gender: Male
  • Last Login:Today at 12:17:46 AM
  • Deeply Confused Learner
  • Profession: IT Engineer now retired
How The Model Actually Learns

The model is shown a piece of text and asked:

Quote
"What comes next?"

For example:

Quote
The capital of France is ...

Suppose the model predicts:

Quote
London

The correct answer was:

Quote
Paris

The important question is not that the answer was wrong.

The important question is:

Quote
How does the software determine which of billions of weights caused the error?

This is where the real intelligence of the training system lies.

The training software calculates how much every weight contributed to the mistake.

Some weights contributed heavily.

Others contributed only slightly.

Some may have had almost no effect.

The software then works backwards through the network and computes:

    []Which weights helped.
    []Which weights hurt.
    []How much each weight influenced the result.
    []Whether each weight should increase or decrease.
  • How large the adjustment should be.

The weights are then adjusted by tiny amounts.

Not randomly.

Not manually.

Mathematically.

Weights that pushed the prediction towards the correct answer are reinforced.

Weights that pushed the prediction away from the correct answer are weakened.

This process is called backpropagation.

The entire network gradually teaches itself by continuously asking:

Quote
Which internal settings produced the error?

What tiny changes would have reduced that error?

Apply those changes.

Try again.

The process repeats billions or trillions of times.

Over time the network converges towards weight values that produce increasingly accurate predictions.

In simple terms:

Quote
The model learns because every mistake becomes information about how the internal numerical relationships should be adjusted.

That adjustment process is where most of the mathematical sophistication of modern AI resides.
friendly
0
funny
0
informative
0
agree
0
disagree
0
like
0
dislike
0
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
Our Discord Server invitation link is https://discord.gg/jB2qmRrxyD

Offline smfadmin

  • SMF (internal) Site
  • Administrator
  • Sr. Member
  • *****
  • Join Date: Dec 2014
  • Location: Management
  • Posts: 514
  • Reputation Power: 0
  • smfadmin has hidden their reputation power
  • Last Login:Yesterday at 08:26:09 PM
  • Supplied Install Member
LLM Training vs Inference
« Reply #1 on: Yesterday at 04:22:52 PM »
LLM Training vs Inference

A Systems Programmer's View



The Big Picture

A Large Language Model (LLM) operates in two completely different modes:

  • Training — learning from data.
  • Inference — using what was learned.
The same neural network architecture is used in both cases.

The difference is what happens after a prediction is made.



Inference (Normal ChatGPT Usage)

Inference is what happens when you ask a question and the model generates a reply.

Step 1 — Input

You type:

Code: [Select]
Why would the stored weight parameters be so specific?

Step 2 — Tokenisation

The text is broken into tokens:

Code: [Select]
Why
would
the
stored
weight
parameters
...

Tokens are not necessarily whole words. Common words may be one token, while rare words may become several tokens.

Step 3 — Embeddings

Each token is converted into a vector.

Think:

Code: [Select]
Token -> Vector (list of numbers)

The model no longer processes words directly. It processes mathematical representations of words.

Step 4 — Transformer Processing

The Transformer examines:

  • The current token.
  • Previous tokens.
  • The entire context window.

Using attention mechanisms, it determines the most likely next token.

Step 5 — Generate Next Token

Example:

Code: [Select]
Because

Step 6 — Feed Back Into Context

The growing conversation now becomes:

Code: [Select]
Why would the stored weight parameters be so specific?

Because

The entire sequence is processed again.

Step 7 — Repeat

Generate:

Code: [Select]
the

Then:

Code: [Select]
network

Then:

Code: [Select]
...

This continues until the reply is complete.

Important Point

During inference:

The weights never change.

The model is effectively read-only.

Inference is simply:

Code: [Select]
Input
  ↓
Prediction
  ↓
Output

No learning occurs.



Understand that there is a compute graph and it is the full map of how the input was transformed into the final output during the forward pass. It contains every operation as a node (each node knows what it output during the forward pass and what inputs it depended on.

Back-prop moves backwards through the compute graph using intermediate values.

Training (Learning Phase)

Training starts similarly but contains additional steps.

Step 1 — Input

Training example:

Code: [Select]
The capital of France is Paris

Step 2 — Tokenisation

The text becomes tokens.

Step 3 — Forward Pass

The model predicts the next token.

For example:

Code: [Select]
The capital of France is London

Step 4 — Compare Against Known Answer

Expected:

Code: [Select]
Paris

Predicted:

Code: [Select]
London

An error exists.

Step 5 — Compute Loss

Loss is a numerical measure of error.

Think:

Code: [Select]
Loss = How Wrong Was The Prediction?

Example:

Code: [Select]
Loss = 0.83

Higher loss means larger error.

Lower loss means better prediction.

Step 6 — Backpropagation

This is where training differs from inference.

The system asks:

Quote
Which weights contributed to this error?

The error signal is propagated backwards through the network.

Step 7 — Calculate Gradients Efficiently via Backprogogation.

For each weight:

Code: [Select]
If this weight was slightly larger,
would the error improve or worsen?

If this weight was slightly smaller,
would the error improve or worsen?

These calculations produce gradients.

A gradient tells the optimiser which direction to move each weight.

Step 8 — Adjust Weights

Tiny changes are made:

Code: [Select]
Weight A += 0.00001

Weight B -= 0.00003

Weight C += 0.000001

Not huge changes.

Tiny changes.

Step 9 — Repeat This Loop Billions of Times

Until Convergence:
1. - Loss stops decreasing
2. - It performs well on unseen validation data
3. - Further weight updates no longer meaningfully reduce error on new data.

Step 10 — Next Example

The process repeats.

Again.

And again.

And again.

For months.



Why The Adjustments Are Tiny

Large weight changes would cause instability.

The optimiser therefore takes very small steps.

Think of steering a ship.

Small corrections made continuously are usually more accurate than dramatic turns.

The objective is gradual convergence toward lower error.



Overfitting

Overfitting occurs when the model learns the training examples too specifically (ie, it remembers instead of learning)

Instead of learning:

Quote
The underlying rule

it learns:

Quote
The exact examples

Example:

Memorising:

Code: [Select]
2 x 2 = 4
3 x 3 = 9
4 x 4 = 16

Generalising:

Quote
Understanding multiplication itself.

The goal of training is not merely memorisation.

The goal is generalisation.



The Whole Distribution

Researchers often refer to:

Quote
The Distribution

This means:

Quote
The entire population of examples the model may encounter.

Not merely the training data.

For language models, this includes:

  • Books
  • Articles
  • Technical manuals
  • Emails
  • Forum posts
  • Conversations
  • Documentation

The training set is only a sample of the larger distribution.

The model must perform well across the entire distribution, not merely on examples it has already seen.



Samuel's Checkers vs Modern LLMs

Arthur Samuel's Checkers Program:

Code: [Select]
Board Position
      ↓
Evaluation Function
      ↓
Move Selection
      ↓
Game Result
      ↓
Adjust Weights

Modern LLM:

Code: [Select]
Tokens
   ↓
Prediction
   ↓
Error
   ↓
Adjust Weights

The basic learning loop is remarkably similar.

The major difference is scale.

Samuel's program contained a relatively small number of meaningful parameters.

Modern LLMs contain billions of parameters.



A Systems Programmer's Summary

Training

Code: [Select]
READ TRAINING RECORD

RUN MODEL

COMPARE OUTPUT TO KNOWN ANSWER

COMPUTE ERROR

UPDATE PARAMETER TABLE

REPEAT

Inference

Code: [Select]
READ INPUT

RUN MODEL

EMIT OUTPUT

REPEAT UNTIL REPLY COMPLETE

The parameter table is not modified during inference.



The Single Most Important Difference

Inference asks:

Quote
Given these weights, what is the most likely next token?

Training asks:

Quote
How should these weights be adjusted so future predictions become less wrong?

Everything else in modern deep learning is built around these two processes.
« Last Edit: Yesterday at 05:31:34 PM by Chip »
friendly
0
funny
0
informative
0
agree
0
disagree
0
like
0
dislike
0
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
measure twice, cut once

Tags:
 

Related Topics

  Subject / Started by Replies Last post
0 Replies
22788 Views
Last post April 27, 2015, 02:43:51 AM
by smfadmin
1 Replies
26843 Views
Last post March 29, 2016, 12:42:00 PM
by Z
0 Replies
18014 Views
Last post May 04, 2018, 12:38:25 PM
by Chip
0 Replies
11777 Views
Last post January 16, 2025, 11:46:31 AM
by smfadmin
0 Replies
12277 Views
Last post January 25, 2025, 09:59:08 AM
by smfadmin
0 Replies
14045 Views
Last post March 09, 2025, 07:38:58 PM
by Chip
0 Replies
15597 Views
Last post March 22, 2025, 01:51:54 AM
by smfadmin
1 Replies
16031 Views
Last post May 04, 2026, 03:28:14 PM
by brookemonk
1 Replies
17875 Views
Last post May 10, 2025, 12:33:44 PM
by Chip
0 Replies
17140 Views
Last post November 23, 2025, 11:05:55 AM
by Chip


dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse





TERMS AND CONDITIONS

In no event will d&u or any person involved in creating, producing, or distributing site information be liable for any direct, indirect, incidental, punitive, special or consequential damages arising out of the use of or inability to use d&u. You agree to indemnify and hold harmless d&u, its domain founders, sponsors, maintainers, server administrators, volunteers and contributors from and against all liability, claims, damages, costs and expenses, including legal fees, that arise directly or indirectly from the use of any part of the d&u site.


TO USE THIS WEBSITE YOU MUST AGREE TO THE TERMS AND CONDITIONS ABOVE


Founded December 2014
SimplePortal 2.3.6 © 2008-2014, SimplePortal