Author Topic: MoE - Mixture of Experts (Read 333 times)

Chip · « **on:** May 28, 2026, 11:38:06 AM »

https://www.youtube.com/embed/sYDlVVyJYn4i=tMea7-aHe9_rRroB

Mixture of Experts (MoE)

MoE is basically a trick to make extremely large models without paying the full compute cost every time you run them.

Instead of one giant network doing everything, you split the model into many smaller subnetworks (“experts”), and only a few are used per input.

What an “expert” actually is

An expert is just a normal neural network block (usually a feedforward layer or MLP) duplicated many times.

So instead of:

1 big MLP

You get:

8, 16, 64+ separate MLPs

Each expert learns slightly different feature transformations.

The router / gating mechanism

A small network called a router decides which experts to use for each token.

It looks at the token embedding and outputs something like:

Expert 3: 70%
Expert 12: 20%
Expert 7: 10%

In practice, usually only the top-k (often 1 or 2) are actually activated.

So the router is the traffic controller.

Sparse activation (the key trick)

Even if you have 64 experts, you might only activate 2 per token.

That means:

Massive parameter count exists in memory
But compute per token stays relatively low

So you get “big brain capacity” without “big brain cost per thought”.

Why MoE scales so well

Normally:

More parameters = more compute every forward pass

With MoE:

More parameters = more memory only
Compute stays roughly constant per token

So you can scale model capacity very aggressively.

This is why MoE broke the old assumption that:

parameter count ≈ compute cost ≈ capability

That relationship is no longer linear.

MoE vs dense models

Dense model (classic GPT-style):

Every parameter is used every time
Predictable behaviour
Expensive as it grows

MoE model:

Only subset of parameters used per token
Much cheaper per inference step
More complex training and routing dynamics

Dense = simple but expensive
MoE = efficient but messy

Real-world examples

Mistral AI → Mixtral models are classic open MoE implementations (e.g. Mixtral 8x7B, 8x22B style designs)
OpenAI → GPT-4 is widely believed (not fully confirmed publicly) to use MoE-like routing internally in some variants
Google → Gemini models are also widely associated with MoE-style scaling strategies in parts of their architecture

Key point: MoE is now standard for frontier-scale efficiency, not an experimental idea.

Trade-offs: memory vs compute

Pros:

Huge parameter counts (trillions become feasible)
Lower inference compute per token
Specialisation (experts can specialise in domains)

Cons:

High memory footprint (you still store everything)
Routing instability (bad gating = bad performance)
Training complexity (load balancing across experts is hard)
Less predictable behaviour than dense models

Why parameter count stopped meaning much

Before MoE:

1B params = roughly twice 500M cost

After MoE:

1T params might behave like a 50B compute model per token

So:

Parameter count now reflects capacity in storage space
Not actual compute cost or runtime intelligence

This is why modern model comparisons based only on “number of parameters” are misleading.

dopetalk

News:

dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse

Author Topic: MoE - Mixture of Experts (Read 333 times)

Chip (OP)

MoE - Mixture of Experts

Related Topics

Need help or a chat ?

If you need any help or a chat then IM/PM or email me, Chip

dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse