Copilot looks at it and offers this rewrite:
How an LLM Generates Language — Back to Basics
When we speak, every word we choose doesn’t just follow the previous one — it must stay relative to the entire sentence, paragraph, and topic all at once. The whole window of meaning constrains each next word.
A Transformer works the same way.
1. Input becomes tokens, tokens become vectors
Your text is broken into tokens, and each token is mapped to a high‑dimensional vector — a numerical representation of meaning relative to all other tokens in the context window. This is the model’s “mental space”.
2. The Transformer processes everything in parallel
All vectors are read simultaneously through self‑attention.
Every token “looks at” every other token to determine what matters, what relates, and what should influence the next step.
3. Output is generated sequentially — one token at a time
Even though the model thinks in parallel, it speaks one token at a time.
Each generated token is immediately fed back into the model, added to the sequence, and the entire sequence is re‑evaluated in parallel to choose the next token.
4. The loop is the memory
There is no separate memory module.
The growing token sequence is the memory.
Every new token forces the model to re‑interpret the entire context window before choosing the next one.
5. The human parallel
Just like you don’t choose a word based only on the previous word, the model doesn’t either.
Each new token must remain coherent with:
- the previous token
- the whole sentence
- the topic
- the global context window
All at once.
Me:
How does it determine that the chain is complete?
AI:
Great question — and this is one of those spots where the architecture is surprisingly simple once you see it clearly.
How does the model know the chain is “complete”?
Short answer:
It doesn’t “know” — it decides to stop by emitting a special token called the end‑of‑text token.
That’s it.
No mystical “sentence detector”.
No grammar engine.
No semantic completeness checker.
Just a token.
---
The long version (the real mechanics)
At every generation step, the model is choosing the next token from its vocabulary.
Among those tokens is a special one:
- end‑of‑text token (often written as or <|endoftext|> depending on the model)
If the model predicts that token as the most likely next token, generation stops.
Why would it choose that token?
Because the model has learned statistical patterns like:
- “This looks like the end of an answer.”
- “This structure usually stops here.”
- “This is a natural completion point.”
- “The user asked for a short reply.”
- “The probability of continuing is now lower than the probability of ending.”
It’s all probability, not understanding.
---
So the stopping rule is:
> Stop when the model emits the end‑of‑text token.
Everything else — coherence, completeness, topic closure — is emergent behaviour from training.
---
Why this works so well
Because the model re‑reads the entire chain every time, it can evaluate:
- “Does this answer feel done?”
- “Is the structure complete?”
- “Is the user expecting more?”
- “Is continuing likely to produce nonsense?”
It doesn’t understand these things, but the statistical patterns are strong enough that it behaves as if it does.
---
Human analogy
It’s like when you’re speaking and you just… stop.
Not because you ran a formal “sentence completeness algorithm”, but because:
- the thought is finished
- the structure feels closed
- the probability of continuing drops to zero
LLMs do the same thing — but with math instead of vibes.