dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse


Our Discord Notification Server invitation link is https://discord.gg/jB2qmRrxyD

Author Topic: The AI Tokenisation Pipeline  (Read 9 times)

Online Chip (OP)

  • Server Admin
  • Hero Member
  • *****
  • Administrator
  • *****
  • Join Date: Dec 2014
  • Location: Australia
  • Posts: 7149
  • Reputation Power: 0
  • Chip has hidden their reputation power
  • Gender: Male
  • Last Login:Today at 11:27:06 PM
  • Deeply Confused Learner
  • Profession: IT Engineer now retired
The AI Tokenisation Pipeline
« on: Today at 09:03:07 PM »


The AI Tokenisation Pipeline

The AI tokenisation pipeline is the first subsystem you need to understand.

Everything else in AI sits on top of it.

AI models like GPT do not directly read text the way humans do. 
They operate on tokens — small chunks of text converted into numerical representations.

The pipeline looks roughly like this:

Code: [Select]
Raw Text
   ↓
Normalization
   ↓
Tokenisation
   ↓
Vocabulary Lookup
   ↓
Token IDs
   ↓
Embeddings
   ↓
Transformer Processing
   ↓
Predicted Next Token
   ↓
Detokenisation
   ↓
Output Text

1. Raw Text Input

The process starts with ordinary text:

Code: [Select]
"The cat sat on the mat."

At this stage it is just Unicode characters.

The AI cannot directly process characters efficiently, so preprocessing begins.

2. Normalization

The text is cleaned and standardized.

Possible operations include:

  • Unicode normalization
  • Whitespace cleanup
  • Lowercasing (sometimes)
  • Control character removal
  • Encoding standardization

Example:

Code: [Select]
“Hello   world”

"Hello world"

Modern LLMs usually preserve capitalization and punctuation because they contain semantic information.

3. Tokenisation

This is the critical step.

The tokenizer breaks text into smaller reusable units called tokens.

These are NOT necessarily words.

Depending on the tokenizer:

  • Whole words may become tokens
  • Subwords may become tokens
  • Punctuation may become tokens
  • Spaces may become tokens
  • Fragments like "ing" or "pre" may become tokens

Example:

Code: [Select]
"The cat sat on the mat."

Might become:

Code: [Select]
["The", " cat", " sat", " on", " the", " mat", "."]

Notice:
  • Leading spaces are often embedded into tokens
  • Frequent patterns become single tokens
  • Rare words split into smaller pieces

Example of rare word splitting:

Code: [Select]
"tokenisation"

["token", "isation"]

Or even:

Code: [Select]
["tok", "en", "isation"]

depending on the vocabulary.

4. Vocabulary Lookup

Each token maps to a numerical ID stored in the model vocabulary.

Example:

Code: [Select]
"The"     → 464
" cat"    → 3797
" sat"    → 3332
"."       → 13

The sentence becomes:

Code: [Select]
[464, 3797, 3332, 319, 262, 2603, 13]

This is now machine-readable.

5. Embedding Layer

Raw token IDs are meaningless integers.

The embedding layer converts each token ID into a high-dimensional vector.

Example conceptually:

Code: [Select]
3797

[-0.22, 1.91, 0.44, -0.11, ...]

These vectors encode semantic relationships.

Tokens with related meanings occupy nearby regions in vector space.

Example:

Code: [Select]
cat ≈ dog
Paris ≈ London
run ≈ walk

The embedding dimension may be:

  • 768
  • 1536
  • 4096
  • 12288

depending on model size.

6. Positional Encoding

Transformers do not inherently understand sequence order.

So positional information is added.

Otherwise:

Code: [Select]
"dog bites man"

and

Code: [Select]
"man bites dog"

would appear identical.

Position embeddings inject order awareness.

Conceptually:

Code: [Select]
Token Vector + Position Vector

This creates sequence structure.

7. Transformer Processing

Now the transformer layers operate on the embeddings.

Core mechanisms include:

  • Self-attention
  • Feed-forward networks
  • Residual connections
  • Layer normalization

The attention system allows every token to dynamically reference other tokens.

Example:

Code: [Select]
"The animal didn't cross the street because it was tired."

The model learns:

Code: [Select]
"it" → "animal"

through attention weighting.

Each transformer layer progressively refines the internal representation.

Large models may contain:

  • 24 layers
  • 48 layers
  • 96+ layers

with billions of parameters.

8. Next Token Prediction

LLMs fundamentally predict probabilities for the next token.

Given:

Code: [Select]
"The cat sat on the"

the model estimates:

Code: [Select]
mat      → 62%
floor    → 12%
chair    → 4%
roof     → 2%

A decoding strategy then selects the next token.

Methods include:

  • Greedy decoding
  • Temperature sampling
  • Top-k sampling
  • Top-p (nucleus) sampling

Temperature controls randomness.

Low temperature:
Code: [Select]
More deterministic

High temperature:
Code: [Select]
More creative/random

9. Autoregressive Loop

Once a token is generated:

Code: [Select]
"The cat sat on the mat"

the new token is appended to context and the process repeats.

This creates the rolling generation loop:

Code: [Select]
Input → Predict Token → Append → Predict Again

thousands of times per response.

10. Detokenisation

Finally, tokens are converted back into readable text.

Example:

Code: [Select]
["The", " cat", " sat", "."]

"The cat sat."

The user only sees the reconstructed text.

Why Tokenisation Matters

Tokenisation heavily affects:

  • Model efficiency
  • Context window usage
  • Reasoning quality
  • Multilingual performance
  • Inference cost
  • Compression efficiency

For example:

Code: [Select]
"antidisestablishmentarianism"

may consume many tokens.

Whereas common phrases may compress into very few.

This is why token counts vary dramatically between languages and writing styles.

Common Tokenisation Algorithms

  • BPE (Byte Pair Encoding)
    Merges common symbol pairs iteratively.

  • WordPiece
    Used heavily in BERT-style models.

  • SentencePiece
    Language-independent tokenizer framework.

  • Unigram Language Model
    Probabilistic subword segmentation.

  • Byte-level BPE
    Handles arbitrary Unicode robustly.
Modern GPT-family systems commonly use variants of byte-level BPE.

Important Insight

LLMs do NOT internally manipulate words or meanings directly.

Internally they process:

Code: [Select]
Vectors
Matrices
Attention weights
Probability distributions

Meaning emerges statistically from enormous learned relationships between token patterns.

That is the core trick behind modern large language models.
« Last Edit: Today at 09:36:03 PM by Chip »
friendly
0
funny
0
informative
0
agree
0
disagree
0
like
0
dislike
0
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
Our Discord Server invitation link is https://discord.gg/jB2qmRrxyD

Tags:
 


dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse





TERMS AND CONDITIONS

In no event will d&u or any person involved in creating, producing, or distributing site information be liable for any direct, indirect, incidental, punitive, special or consequential damages arising out of the use of or inability to use d&u. You agree to indemnify and hold harmless d&u, its domain founders, sponsors, maintainers, server administrators, volunteers and contributors from and against all liability, claims, damages, costs and expenses, including legal fees, that arise directly or indirectly from the use of any part of the d&u site.


TO USE THIS WEBSITE YOU MUST AGREE TO THE TERMS AND CONDITIONS ABOVE


Founded December 2014
SimplePortal 2.3.6 © 2008-2014, SimplePortal