Author Topic: The AI Tokenisation Pipeline (Read 415 times)

Chip · « **on:** May 27, 2026, 09:03:07 PM »

https://www.youtube.com/embed/lTU2XAbd1Eo

The AI Tokenisation Pipeline

The AI tokenisation pipeline is the first subsystem you need to understand.

Everything else in AI sits on top of it.

AI models like GPT do not directly read text the way humans do.
They operate on tokens — small chunks of text converted into numerical representations.

The pipeline looks roughly like this:

Code: [Select]

Raw Text
   ↓
Normalization
   ↓
Tokenisation
   ↓
Vocabulary Lookup
   ↓
Token IDs
   ↓
Embeddings
   ↓
Transformer Processing
   ↓
Predicted Next Token
   ↓
Detokenisation
   ↓
Output Text

1. Raw Text Input

The process starts with ordinary text:

Code: [Select]

"The cat sat on the mat."

At this stage it is just Unicode characters.

The AI cannot directly process characters efficiently, so preprocessing begins.

2. Normalization

The text is cleaned and standardized.

Possible operations include:

Unicode normalization
Whitespace cleanup
Lowercasing (sometimes)
Control character removal
Encoding standardization

Example:

Code: [Select]

“Hello   world”
↓
"Hello world"

Modern LLMs usually preserve capitalization and punctuation because they contain semantic information.

3. Tokenisation

This is the critical step.

The tokenizer breaks text into smaller reusable units called tokens.

These are NOT necessarily words.

Depending on the tokenizer:

Whole words may become tokens
Subwords may become tokens
Punctuation may become tokens
Spaces may become tokens
Fragments like "ing" or "pre" may become tokens

Example:

Code: [Select]

"The cat sat on the mat."

Might become:

Code: [Select]

["The", " cat", " sat", " on", " the", " mat", "."]

Notice:

Leading spaces are often embedded into tokens
Frequent patterns become single tokens
Rare words split into smaller pieces

Example of rare word splitting:

Code: [Select]

"tokenisation"
↓
["token", "isation"]

Or even:

Code: [Select]

["tok", "en", "isation"]

depending on the vocabulary.

4. Vocabulary Lookup

Each token maps to a numerical ID stored in the model vocabulary.

Example:

Code: [Select]

"The"     → 464
" cat"    → 3797
" sat"    → 3332
"."       → 13

The sentence becomes:

Code: [Select]

[464, 3797, 3332, 319, 262, 2603, 13]

This is now machine-readable.

5. Embedding Layer

Raw token IDs are meaningless integers.

The embedding layer converts each token ID into a high-dimensional vector.

Example conceptually:

Code: [Select]

3797
↓
[-0.22, 1.91, 0.44, -0.11, ...]

These vectors encode semantic relationships.

Tokens with related meanings occupy nearby regions in vector space.

Example:

Code: [Select]

cat ≈ dog
Paris ≈ London
run ≈ walk

The embedding dimension may be:

768
1536
4096
12288

depending on model size.

6. Positional Encoding

Transformers do not inherently understand sequence order.

So positional information is added.

Otherwise:

Code: [Select]

"dog bites man"

and

Code: [Select]

"man bites dog"

would appear identical.

Position embeddings inject order awareness.

Conceptually:

Code: [Select]

Token Vector + Position Vector

This creates sequence structure.

7. Transformer Processing

Now the transformer layers operate on the embeddings.

Core mechanisms include:

Self-attention
Feed-forward networks
Residual connections
Layer normalization

The attention system allows every token to dynamically reference other tokens.

Example:

Code: [Select]

"The animal didn't cross the street because it was tired."

The model learns:

Code: [Select]

"it" → "animal"

through attention weighting.

Each transformer layer progressively refines the internal representation.

Large models may contain:

24 layers
48 layers
96+ layers

with billions of parameters.

8. Next Token Prediction

LLMs fundamentally predict probabilities for the next token.

Given:

Code: [Select]

"The cat sat on the"

the model estimates:

Code: [Select]

mat      → 62%
floor    → 12%
chair    → 4%
roof     → 2%

A decoding strategy then selects the next token.

Methods include:

Greedy decoding
Temperature sampling
Top-k sampling
Top-p (nucleus) sampling

Temperature controls randomness.

Low temperature:

Code: [Select]

More deterministic

High temperature:

Code: [Select]

More creative/random

9. Autoregressive Loop

Once a token is generated:

Code: [Select]

"The cat sat on the mat"

the new token is appended to context and the process repeats.

This creates the rolling generation loop:

Code: [Select]

Input → Predict Token → Append → Predict Again

thousands of times per response.

10. Detokenisation

Finally, tokens are converted back into readable text.

Example:

Code: [Select]

["The", " cat", " sat", "."]
↓
"The cat sat."

The user only sees the reconstructed text.

Why Tokenisation Matters

Tokenisation heavily affects:

Model efficiency
Context window usage
Reasoning quality
Multilingual performance
Inference cost
Compression efficiency

For example:

Code: [Select]

"antidisestablishmentarianism"

may consume many tokens.

Whereas common phrases may compress into very few.

This is why token counts vary dramatically between languages and writing styles.

Common Tokenisation Algorithms

BPE (Byte Pair Encoding)
Merges common symbol pairs iteratively.
WordPiece
Used heavily in BERT-style models.
SentencePiece
Language-independent tokenizer framework.
Unigram Language Model
Probabilistic subword segmentation.
Byte-level BPE
Handles arbitrary Unicode robustly.

Modern GPT-family systems commonly use variants of byte-level BPE.

Important Insight

LLMs do NOT internally manipulate words or meanings directly.

Internally they process:

Code: [Select]

Vectors
Matrices
Attention weights
Probability distributions

Meaning emerges statistically from enormous learned relationships between token patterns.

That is the core trick behind modern large language models.

smfadmin · « **Reply #1 on:** May 30, 2026, 05:38:55 AM »

TOKEN LIFECYCLE IN MODERN AI SYSTEMS

1. INPUT → TOKENISATION
Raw text is converted into tokens.

- Text is split into subword units (not words or characters)
- Each token becomes a numeric ID in a vocabulary

Example:
"mainframe systems work"
→ ["main", "frame", " systems", " work"] (conceptual)

Result:
text → token IDs

2. TOKEN EMBEDDING
Token IDs are mapped into vectors.

- Each token ID → high-dimensional embedding vector
- Encodes learned semantic relationships

Result:
token IDs → embedding space vectors

3. CONTEXT ASSEMBLY
Tokens are assembled into a single sequence:

- User input
- System instructions
- Conversation history
- Tool metadata (if applicable)

This forms the context window.

Constraint:
- Fixed maximum size
- Older tokens may be truncated

Result:
ordered token sequence in context window

4. TRANSFORMER PROCESSING (ATTENTION LIFECYCLE)

Each token interacts with others via attention:

- Tokens “attend” to other tokens
- Compute relevance weights
- Build contextual representation

Transformation stages:
- syntax understanding (early layers)
- semantic relationships (middle layers)
- task reasoning (late layers)

Result:
static embeddings → context-aware representations

5. LAYERED INTERNAL TRANSFORMATION

Across transformer layers:

token representation evolves repeatedly:
- Layer N transforms Layer N-1
- information is progressively refined

Result:
tokens become contextual feature vectors

6. OUTPUT TOKEN PREDICTION (DECODING)

The model generates output step-by-step:

- Computes probability distribution over vocabulary
- Selects next token
- Repeats autoregressively

Result:
token-by-token generation loop

7. DETOKENISATION

Generated tokens are converted back to text:

- token IDs → subwords → human-readable text
- formatting restored

Result:
final response text

8. EPHEMERAL NATURE OF TOKENS

By default:

- tokens are not stored permanently
- exist only within context window
- discarded when out of scope

Lifecycle:
input → compute → output → discard

9. TRAINING-TIME TOKEN LIFECYCLE (SEPARATE PATH)

This is important so take note.

During training:

A. Token ingestion
- large datasets tokenised

B. Forward pass
- model predicts next tokens

C. Loss calculation
- compare prediction vs actual tokens

D. Backpropagation
- update model weights

Result:
tokens influence model parameters permanently

10. SYSTEM MAPPING (ENGINEERING ANALOGY)

Runtime system:
- input tokens = job queue
- attention = execution engine
- output tokens = results

Training system:
- tokens = batch input data
- loss = reconciliation step
- weight update = system image update

SUMMARY

Tokens are:
transient computational units that are continuously reinterpreted inside a fixed model to produce output, without persistence unless explicitly trained into weights.

dopetalk

News:

dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse

Author Topic: The AI Tokenisation Pipeline (Read 415 times)

Chip (OP)

The AI Tokenisation Pipeline

smfadmin

Re: The AI Tokenisation Pipeline

Related Topics

Need help or a chat ?

If you need any help or a chat then IM/PM or email me, Chip

dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse