The AI Tokenisation PipelineThe AI tokenisation pipeline is the first subsystem you need to understand.
Everything else in AI sits on top of it.
AI models like GPT do not directly read text the way humans do.
They operate on
tokens — small chunks of text converted into numerical representations.
The pipeline looks roughly like this:
Raw Text
↓
Normalization
↓
Tokenisation
↓
Vocabulary Lookup
↓
Token IDs
↓
Embeddings
↓
Transformer Processing
↓
Predicted Next Token
↓
Detokenisation
↓
Output Text
1. Raw Text InputThe process starts with ordinary text:
"The cat sat on the mat."
At this stage it is just Unicode characters.
The AI cannot directly process characters efficiently, so preprocessing begins.
2. NormalizationThe text is cleaned and standardized.
Possible operations include:
- Unicode normalization
- Whitespace cleanup
- Lowercasing (sometimes)
- Control character removal
- Encoding standardization
Example:
“Hello world”
↓
"Hello world"
Modern LLMs usually preserve capitalization and punctuation because they contain semantic information.
3. TokenisationThis is the critical step.
The tokenizer breaks text into smaller reusable units called tokens.
These are NOT necessarily words.
Depending on the tokenizer:
- Whole words may become tokens
- Subwords may become tokens
- Punctuation may become tokens
- Spaces may become tokens
- Fragments like "ing" or "pre" may become tokens
Example:
"The cat sat on the mat."
Might become:
["The", " cat", " sat", " on", " the", " mat", "."]
Notice:
- Leading spaces are often embedded into tokens
- Frequent patterns become single tokens
- Rare words split into smaller pieces
Example of rare word splitting:
"tokenisation"
↓
["token", "isation"]
Or even:
["tok", "en", "isation"]
depending on the vocabulary.
4. Vocabulary LookupEach token maps to a numerical ID stored in the model vocabulary.
Example:
"The" → 464
" cat" → 3797
" sat" → 3332
"." → 13
The sentence becomes:
[464, 3797, 3332, 319, 262, 2603, 13]
This is now machine-readable.
5. Embedding LayerRaw token IDs are meaningless integers.
The embedding layer converts each token ID into a high-dimensional vector.
Example conceptually:
3797
↓
[-0.22, 1.91, 0.44, -0.11, ...]
These vectors encode semantic relationships.
Tokens with related meanings occupy nearby regions in vector space.
Example:
cat ≈ dog
Paris ≈ London
run ≈ walk
The embedding dimension may be:
depending on model size.
6. Positional EncodingTransformers do not inherently understand sequence order.
So positional information is added.
Otherwise:
"dog bites man"
and
"man bites dog"
would appear identical.
Position embeddings inject order awareness.
Conceptually:
Token Vector + Position Vector
This creates sequence structure.
7. Transformer ProcessingNow the transformer layers operate on the embeddings.
Core mechanisms include:
- Self-attention
- Feed-forward networks
- Residual connections
- Layer normalization
The attention system allows every token to dynamically reference other tokens.
Example:
"The animal didn't cross the street because it was tired."
The model learns:
"it" → "animal"
through attention weighting.
Each transformer layer progressively refines the internal representation.
Large models may contain:
- 24 layers
- 48 layers
- 96+ layers
with billions of parameters.
8. Next Token PredictionLLMs fundamentally predict probabilities for the next token.
Given:
"The cat sat on the"
the model estimates:
mat → 62%
floor → 12%
chair → 4%
roof → 2%
A decoding strategy then selects the next token.
Methods include:
- Greedy decoding
- Temperature sampling
- Top-k sampling
- Top-p (nucleus) sampling
Temperature controls randomness.
Low temperature:
More deterministic
High temperature:
More creative/random
9. Autoregressive LoopOnce a token is generated:
"The cat sat on the mat"
the new token is appended to context and the process repeats.
This creates the rolling generation loop:
Input → Predict Token → Append → Predict Again
thousands of times per response.
10. DetokenisationFinally, tokens are converted back into readable text.
Example:
["The", " cat", " sat", "."]
↓
"The cat sat."
The user only sees the reconstructed text.
Why Tokenisation MattersTokenisation heavily affects:
- Model efficiency
- Context window usage
- Reasoning quality
- Multilingual performance
- Inference cost
- Compression efficiency
For example:
"antidisestablishmentarianism"
may consume many tokens.
Whereas common phrases may compress into very few.
This is why token counts vary dramatically between languages and writing styles.
Common Tokenisation Algorithms- BPE (Byte Pair Encoding)
Merges common symbol pairs iteratively.
- WordPiece
Used heavily in BERT-style models.
- SentencePiece
Language-independent tokenizer framework.
- Unigram Language Model
Probabilistic subword segmentation.
- Byte-level BPE
Handles arbitrary Unicode robustly.
Modern GPT-family systems commonly use variants of byte-level BPE.
Important InsightLLMs do NOT internally manipulate words or meanings directly.
Internally they process:
Vectors
Matrices
Attention weights
Probability distributions
Meaning emerges statistically from enormous learned relationships between token patterns.
That is the core trick behind modern large language models.