i=tMea7-aHe9_rRroB
Mixture of Experts (MoE)MoE is basically a trick to make extremely large models without paying the full compute cost every time you run them.
Instead of one giant network doing everything, you split the model into many smaller subnetworks (“experts”), and only a few are used per input.
What an “expert” actually isAn expert is just a normal neural network block (usually a feedforward layer or MLP) duplicated many times.
So instead of:
1 big MLP
You get:
8, 16, 64+ separate MLPs
Each expert learns slightly different feature transformations.
The router / gating mechanismA small network called a router decides which experts to use for each token.
It looks at the token embedding and outputs something like:
Expert 3: 70%
Expert 12: 20%
Expert 7: 10%
In practice, usually only the top-k (often 1 or 2) are actually activated.
So the router is the traffic controller.
Sparse activation (the key trick)Even if you have 64 experts, you might only activate 2 per token.
That means:
Massive parameter count exists in memory
But compute per token stays relatively low
So you get “big brain capacity” without “big brain cost per thought”.
Why MoE scales so wellNormally:
More parameters = more compute every forward pass
With MoE:
More parameters = more memory only
Compute stays roughly constant per token
So you can scale model capacity very aggressively.
This is why MoE broke the old assumption that:
parameter count ≈ compute cost ≈ capability
That relationship is no longer linear.
MoE vs dense modelsDense model (classic GPT-style):
Every parameter is used every time
Predictable behaviour
Expensive as it grows
MoE model:
Only subset of parameters used per token
Much cheaper per inference step
More complex training and routing dynamics
Dense = simple but expensive
MoE = efficient but messy
Real-world examplesMistral AI → Mixtral models are classic open MoE implementations (e.g. Mixtral 8x7B, 8x22B style designs)
OpenAI → GPT-4 is widely believed (not fully confirmed publicly) to use MoE-like routing internally in some variants
Google → Gemini models are also widely associated with MoE-style scaling strategies in parts of their architecture
Key point: MoE is now standard for frontier-scale efficiency, not an experimental idea.
Trade-offs: memory vs computePros:Huge parameter counts (trillions become feasible)
Lower inference compute per token
Specialisation (experts can specialise in domains)
Cons:High memory footprint (you still store everything)
Routing instability (bad gating = bad performance)
Training complexity (load balancing across experts is hard)
Less predictable behaviour than dense models
Why parameter count stopped meaning muchBefore MoE:
1B params = roughly twice 500M cost
After MoE:
1T params might behave like a 50B compute model per token
So:
Parameter count now reflects capacity in storage space
Not actual compute cost or runtime intelligence
This is why modern model comparisons based only on “number of parameters” are misleading.