dopetalk

Simple Machines Forum

News:

🧾✨ Link to our Forum Charter: Read, Respect, Reflect

Solicitation and Dealing of Drugs is Strictly Prohibited !
Please email smfadmin if you wish to advertise here
Non drug topics are also very welcome !
All Terms and Conditions are at the Bottom of the Page

dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse

Our Discord Notification Server invitation link is https://discord.gg/jB2qmRrxyD

« previous next »

Print

Pages: 1 Go Down

Author Topic: How Large Language Models Learn — A Simple Explanation (Read 27 times)

Chip (OP)

Server Admin
Hero Member
Administrator
Join Date: Dec 2014
Location: Australia
Posts: 7215
Reputation Power: 0
Gender:
Last Login:Today at 12:17:46 AM
Deeply Confused Learner
Profession: IT Engineer now retired

How Large Language Models Learn — A Simple Explanation

« on: Yesterday at 03:31:15 PM »

How The Model Actually Learns

The model is shown a piece of text and asked:

Quote

"What comes next?"

For example:

Quote

The capital of France is ...

Suppose the model predicts:

Quote

London

The correct answer was:

Quote

Paris

The important question is not that the answer was wrong.

The important question is:

Quote

How does the software determine which of billions of weights caused the error?

This is where the real intelligence of the training system lies.

The training software calculates how much every weight contributed to the mistake.

Some weights contributed heavily.

Others contributed only slightly.

Some may have had almost no effect.

The software then works backwards through the network and computes:

How large the adjustment should be.

The weights are then adjusted by tiny amounts.

Not randomly.

Not manually.

Mathematically.

Weights that pushed the prediction towards the correct answer are reinforced.

Weights that pushed the prediction away from the correct answer are weakened.

This process is called backpropagation.

The entire network gradually teaches itself by continuously asking:

Quote

Which internal settings produced the error?

What tiny changes would have reduced that error?

Apply those changes.

Try again.

The process repeats billions or trillions of times.

Over time the network converges towards weight values that produce increasingly accurate predictions.

In simple terms:

Quote

The model learns because every mistake becomes information about how the internal numerical relationships should be adjusted.

That adjustment process is where most of the mathematical sophistication of modern AI resides.

0

0

0

0

0

0

0

No reactions

No reactions

No reactions

No reactions

No reactions

No reactions

No reactions

Our Discord Server invitation link is https://discord.gg/jB2qmRrxyD

smfadmin

SMF (internal) Site
Administrator
Sr. Member
Join Date: Dec 2014
Location: Management
Posts: 514
Reputation Power: 0
Last Login:Yesterday at 08:26:09 PM
Supplied Install Member

LLM Training vs Inference

« Reply #1 on: Yesterday at 04:22:52 PM »

LLM Training vs Inference

A Systems Programmer's View

The Big Picture

A Large Language Model (LLM) operates in two completely different modes:

Training — learning from data.
Inference — using what was learned.

The same neural network architecture is used in both cases.

The difference is what happens after a prediction is made.

Inference (Normal ChatGPT Usage)

Inference is what happens when you ask a question and the model generates a reply.

Step 1 — Input

You type:

Why would the stored weight parameters be so specific?

Step 2 — Tokenisation

The text is broken into tokens:

Why
would
the
stored
weight
parameters
...

Tokens are not necessarily whole words. Common words may be one token, while rare words may become several tokens.

Step 3 — Embeddings

Each token is converted into a vector.

Think:

Token -> Vector (list of numbers)

The model no longer processes words directly. It processes mathematical representations of words.

Step 4 — Transformer Processing

The Transformer examines:

The current token.
Previous tokens.
The entire context window.

Using attention mechanisms, it determines the most likely next token.

Step 5 — Generate Next Token

Example:

Because

Step 6 — Feed Back Into Context

The growing conversation now becomes:

Why would the stored weight parameters be so specific?

Because

The entire sequence is processed again.

Step 7 — Repeat

Generate:

the

Then:

network

Then:

...

This continues until the reply is complete.

Important Point

During inference:

The weights never change.

The model is effectively read-only.

Inference is simply:

Input
  ↓
Prediction
  ↓
Output

No learning occurs.

Understand that there is a compute graph and it is the full map of how the input was transformed into the final output during the forward pass. It contains every operation as a node (each node knows what it output during the forward pass and what inputs it depended on.

Back-prop moves backwards through the compute graph using intermediate values.

Training (Learning Phase)

Training starts similarly but contains additional steps.

Step 1 — Input

Training example:

The capital of France is Paris

Step 2 — Tokenisation

The text becomes tokens.

Step 3 — Forward Pass

The model predicts the next token.

For example:

The capital of France is London

Step 4 — Compare Against Known Answer

Expected:

Paris

Predicted:

London

An error exists.

Step 5 — Compute Loss

Loss is a numerical measure of error.

Think:

Loss = How Wrong Was The Prediction?

Example:

Loss = 0.83

Higher loss means larger error.

Lower loss means better prediction.

Step 6 — Backpropagation

This is where training differs from inference.

The system asks:

Quote

Which weights contributed to this error?

The error signal is propagated backwards through the network.

Step 7 — Calculate Gradients Efficiently via Backprogogation.

For each weight:

If this weight was slightly larger,
would the error improve or worsen?

If this weight was slightly smaller,
would the error improve or worsen?

These calculations produce gradients.

A gradient tells the optimiser which direction to move each weight.

Step 8 — Adjust Weights

Tiny changes are made:

Weight A += 0.00001

Weight B -= 0.00003

Weight C += 0.000001

Not huge changes.

Tiny changes.

Step 9 — Repeat This Loop Billions of Times

Until Convergence:
1. - Loss stops decreasing
2. - It performs well on unseen validation data
3. - Further weight updates no longer meaningfully reduce error on new data.

Step 10 — Next Example

The process repeats.

Again.

And again.

And again.

For months.

Why The Adjustments Are Tiny

Large weight changes would cause instability.

The optimiser therefore takes very small steps.

Think of steering a ship.

Small corrections made continuously are usually more accurate than dramatic turns.

The objective is gradual convergence toward lower error.

Overfitting

Overfitting occurs when the model learns the training examples too specifically (ie, it remembers instead of learning)

Instead of learning:

Quote

The underlying rule

it learns:

Quote

The exact examples

Example:

Memorising:

2 x 2 = 4
3 x 3 = 9
4 x 4 = 16

Generalising:

Quote

Understanding multiplication itself.

The goal of training is not merely memorisation.

The goal is generalisation.

The Whole Distribution

Researchers often refer to:

Quote

The Distribution

This means:

Quote

The entire population of examples the model may encounter.

Not merely the training data.

For language models, this includes:

Books
Articles
Technical manuals
Emails
Forum posts
Conversations
Documentation

The training set is only a sample of the larger distribution.

The model must perform well across the entire distribution, not merely on examples it has already seen.

Samuel's Checkers vs Modern LLMs

Arthur Samuel's Checkers Program:

Board Position
      ↓
Evaluation Function
      ↓
Move Selection
      ↓
Game Result
      ↓
Adjust Weights

Modern LLM:

Tokens
   ↓
Prediction
   ↓
Error
   ↓
Adjust Weights

The basic learning loop is remarkably similar.

The major difference is scale.

Samuel's program contained a relatively small number of meaningful parameters.

Modern LLMs contain billions of parameters.

A Systems Programmer's Summary

Training

READ TRAINING RECORD

RUN MODEL

COMPARE OUTPUT TO KNOWN ANSWER

COMPUTE ERROR

UPDATE PARAMETER TABLE

REPEAT

Inference

READ INPUT

RUN MODEL

EMIT OUTPUT

REPEAT UNTIL REPLY COMPLETE

The parameter table is not modified during inference.

The Single Most Important Difference

Inference asks:

Quote

Given these weights, what is the most likely next token?

Training asks:

Quote

How should these weights be adjusted so future predictions become less wrong?

Everything else in modern deep learning is built around these two processes.

« Last Edit: Yesterday at 05:31:34 PM by Chip »

0

0

0

0

0

0

0

No reactions

No reactions

No reactions

No reactions

No reactions

No reactions

No reactions

measure twice, cut once

Print

Pages: 1 Go Up

« previous next »

Tags:

Related Topics

		Subject / Started by	Replies	Last post
		Enable/Disable Simple Portal Started by smfadmin Technical Stuff	0 Replies 22788 Views	April 27, 2015, 02:43:51 AM by smfadmin
		nurse caught stealing large amounts of opiates Started by clinton In the Media	1 Replies 26843 Views	March 29, 2016, 12:42:00 PM by Z
		Subdominant Dense Clusters Allow for Simple Learning and ... Started by Chip Neuroscience	0 Replies 18014 Views	May 04, 2018, 12:38:25 PM by Chip
		Generative AI and LLMs (or Large Language Models) Started by smfadmin Artificial Intelligence / Deep Learning	0 Replies 11777 Views	January 16, 2025, 11:46:31 AM by smfadmin
		Optimised Fire Suppression Formula for Large Fires Started by smfadmin My Collaborative Ideas Using Native and Browser-embedded MS Copilot & OpenAI's ChatGPT	0 Replies 12277 Views	January 25, 2025, 09:59:08 AM by smfadmin
		A possible plausible explanation for why some primates became humans Started by Chip Evolution	0 Replies 14045 Views	March 09, 2025, 07:38:58 PM by Chip
		Simple blood tests could be the future of cancer diagnosis Started by smfadmin Health	0 Replies 15597 Views	March 22, 2025, 01:51:54 AM by smfadmin
		Which programming language should I use? Started by Chip Assorted and Other Tech.	1 Replies 16031 Views	May 04, 2026, 03:28:14 PM by brookemonk
		Detailed Mouse Brain Maps in Simple Pictures Started by smfadmin Neuroscience	1 Replies 17875 Views	May 10, 2025, 12:33:44 PM by Chip
		Simple amino acid supplement greatly reduces Alzheimer’s damage Started by Chip Amino Acids and Supplements	0 Replies 17140 Views	November 23, 2025, 11:05:55 AM by Chip

It appears that you have not registered with dopetalk. To register, please click here...

Need help or a chat ?

If you need any help or a chat then IM/PM or email me, Chip

dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse

TERMS AND CONDITIONS

In no event will d&u or any person involved in creating, producing, or distributing site information be liable for any direct, indirect, incidental, punitive, special or consequential damages arising out of the use of or inability to use d&u. You agree to indemnify and hold harmless d&u, its domain founders, sponsors, maintainers, server administrators, volunteers and contributors from and against all liability, claims, damages, costs and expenses, including legal fees, that arise directly or indirectly from the use of any part of the d&u site.

TO USE THIS WEBSITE YOU MUST AGREE TO THE TERMS AND CONDITIONS ABOVE

Founded December 2014

SMF 2.0.19 | SMF © 2021, Simple Machines
Simple Audio Video Embedder
SMFAds for Free Forums | Sitemap | Terms and Policies
XHTML
RSS
WAP2

Server load over the past 5, 10 and 15 minutes respectively: 0, 0.04, 0.07

Page created in 0.223 seconds with 118 queries.

SimplePortal 2.3.6 © 2008-2014, SimplePortal