Whilst not perfect, this should get you up to speed, in this order:
History: The First AI - Samuel’s Checkers Machine Learning Systemhttps://forum.drugs-and-users.org/index.php?topic=7338Historical grounding.
Shows the origins of machine learning and self-improving systems.
0. Machine Learning Foundationshttps://forum.drugs-and-users.org/index.php?topic=7368Outlines the various Machine Learning systems and has an evolutionary timeline
See Example Code:
Samuel’s Checkers in Python and My Food Preference Model in Psuedocode1. Pretraining modern AI / LLMshttps://forum.drugs-and-users.org/index.php?topic=7365There is a conceptual ancestry to The First AI:
Samuel → early machine learning idea: “systems can improve from data instead of rules”
Later: reinforcement learning (temporal difference learning, policy/value methods)
Much later: deep learning + backprop + Transformers
Then: RLHF (reinforcement learning from human feedback) is applied after pretraining, following an initial instruction fine-tuning stage. Pretraining is large-scale next-token prediction on massive text data. RLHF uses human preference signals to optimize the model’s outputs toward more helpful, safe, and conversational behaviour, shaping the surface interaction style rather than core knowledge.
If anything, Samuel’s system is closer to modern reinforcement learning agents, not language model pretraining.
2. The basic components of AI and how the data/query flows through to the replyhttps://forum.drugs-and-users.org/index.php?topic=7347High-level architectural overview before deep diving.
Stacks and "subsystems".
3. How Neural Networks Workhttps://forum.drugs-and-users.org/index.php?topic=7343Core foundation.
Everything modern comes from this.
4. The AI Tokenisation Pipelinehttps://forum.drugs-and-users.org/index.php?topic=7350Introducing tokens as massively multidimensional vectors
Now the reader understands WHY text must become vectors and embeddings.
5. Transformershttps://forum.drugs-and-users.org/index.php?topic=7344The real breakthrough architecture behind modern LLMs.
Transformers are built directly on attention mechanisms, which are explained in section 9.
6. LLMs Explained -- A light introduction to LLMs, chatbots, pretraining, and transformershttps://forum.drugs-and-users.org/index.php?topic=7342Applies the transformer concept to actual LLM systems and chatbot behaviour.
7. RAG - Retrieval Augmented Generationhttps://forum.drugs-and-users.org/index.php?topic=7348Advanced modern extension layer.
Shows how models interface with external knowledge.
Those were the basics and once you roughly understand them then continue on with the following topics:
8. Embeddings and Vector Spaceshttps://forum.drugs-and-users.org/index.php?topic=7351Right now embeddings are probably buried inside tokenisation or neural networks, but embeddings are absolutely central to modern AI.
Topics:
- What embeddings actually are
- High-dimensional vector spaces
- Semantic proximity
- Why "cat" and "dog" cluster together
- Cosine similarity
- Latent space
- Why RAG works
- Why hallucinations happen
This becomes the bridge between:
Token IDs → Meaning Space
9. Attention Mechanisms and Self-Attentionhttps://forum.drugs-and-users.org/index.php?topic=7352Transformers really deserve to be split and Attention is the actual revolutionary mechanism.
Topics:
- Query / Key / Value vectors
- Attention weighting
- Context windows
- Token relationships
- Parallel processing vs recurrence
- Why transformers replaced RNNs (Recurrent Neural Networks) /LSTMs (RNN with a Long Short-Term Memory
Without attention, transformers look like magic.
10. Training vs Inferencehttps://forum.drugs-and-users.org/index.php?topic=7353This is one of the most misunderstood things in AI discussions.
Most people think ChatGPT is "learning while talking."
It usually is not.
Topics:
- Pretraining
- Gradient descent
- Backpropagation
- Weights
- Inference-only operation
- Fine tuning — including LoRA and QLoRA (efficient low-rank adaptation)
- RLHF
- Why models are static snapshots
This clears up enormous confusion.
11. Context Windows and Memoryhttps://forum.drugs-and-users.org/index.php?topic=7354Critical for chatbot understanding.
Topics:
- What context windows are
- Token limits
- Sliding attention windows
- Conversation truncation
- Why models "forget"
- Persistent memory systems
- RAG vs memory
This directly explains chatbot behaviour.
12. Hallucinations and Failure Modeshttps://forum.drugs-and-users.org/index.php?topic=7355Very important as AIs won't say "I don't know".
Topics:
- Probabilistic generation
- Why confidence ≠ correctness
- Distribution gaps
- Mode collapse
- Confabulation
- Context poisoning
- Prompt injection
Most people fundamentally misunderstand hallucinations.
13. Multi-Modal AIhttps://forum.drugs-and-users.org/index.php?topic=7356Modern systems are no longer just text.
Topics:
- Vision transformers
- Image tokenisation
- Audio embeddings
- Cross-modal embeddings
- Unified latent spaces
- Image generation diffusion models
This connects LLMs to image/video/audio systems.
14. Agents and Tool Usehttps://forum.drugs-and-users.org/index.php?topic=7357Modern frontier AI architecture.
Topics:
- Tool calling
- External APIs
- Planning loops
- Chain-of-thought orchestration
- Autonomous agents
- Memory stores
- Execution environments
This is where systems are heading now.
15. Scaling Lawshttps://forum.drugs-and-users.org/index.php?topic=7358Very important historically.
Topics:
- Why bigger models suddenly worked
- Emergent behaviour
- Parameter scaling
- Data scaling
- Compute scaling
- Why GPT-3 changed everything
16. Mixture of Experts (MoE)https://forum.drugs-and-users.org/index.php?topic=7367How modern frontier models scale without scaling compute proportionally.
Topics:
- What an "expert" actually is (a sub-network)
- The router / gating mechanism
- Sparse activation — only some experts fire per token
- Why MoE allows enormous parameter counts on modest compute
- MoE vs dense models
- Mixtral, GPT-4 (rumoured), Gemini — real-world examples
- The trade-offs: memory vs compute
Explains why parameter counts stopped being a simple proxy for capability or cost.
17. Quantisation and Model Compressionhttps://forum.drugs-and-users.org/index.php?topic=7360The practical consequence of Scaling Laws.
Topics:
- What model weights actually are at the bit level
- FP32 vs FP16 vs INT8 vs INT4
- How precision reduction affects output quality
- GGUF and GGML formats
- llama.cpp and local inference
- Pruning and knowledge distillation
- Why a 7B quantised model can run on a laptop
This explains why AI progress looked sudden.
Scaling Laws explains why models got enormous.
This explains how ordinary hardware runs them anyway.
Directly relevant to anyone self-hosting or running local models.
18. Diffusion Models and What They Are Forhttps://forum.drugs-and-users.org/index.php?topic=7359When discussing image generation.
Topics:
- Noise schedules
- Denoising
- Latent diffusion
- Classifier guidance
- Why Stable Diffusion works
Completely different architecture family from transformers.
19. The Future of AI — What Is Actually Cominghttps://forum.drugs-and-users.org/index.php?topic=7361By Claude, a prediction only ...
That’s plenty for now !