dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse


Our Discord Notification Server invitation link is https://discord.gg/jB2qmRrxyD

Author Topic: MoE - Mixture of Experts  (Read 26 times)

Online Chip (OP)

  • Server Admin
  • Hero Member
  • *****
  • Administrator
  • *****
  • Join Date: Dec 2014
  • Location: Australia
  • Posts: 7150
  • Reputation Power: 0
  • Chip has hidden their reputation power
  • Gender: Male
  • Last Login:Today at 02:02:38 AM
  • Deeply Confused Learner
  • Profession: IT Engineer now retired
MoE - Mixture of Experts
« on: Yesterday at 11:38:06 AM »
i=tMea7-aHe9_rRroB

Mixture of Experts (MoE)

MoE is basically a trick to make extremely large models without paying the full compute cost every time you run them.

Instead of one giant network doing everything, you split the model into many smaller subnetworks (“experts”), and only a few are used per input.

What an “expert” actually is

An expert is just a normal neural network block (usually a feedforward layer or MLP) duplicated many times.

So instead of:

1 big MLP

You get:

8, 16, 64+ separate MLPs

Each expert learns slightly different feature transformations.

The router / gating mechanism

A small network called a router decides which experts to use for each token.

It looks at the token embedding and outputs something like:

Expert 3: 70%
Expert 12: 20%
Expert 7: 10%

In practice, usually only the top-k (often 1 or 2) are actually activated.

So the router is the traffic controller.

Sparse activation (the key trick)

Even if you have 64 experts, you might only activate 2 per token.

That means:

Massive parameter count exists in memory
But compute per token stays relatively low

So you get “big brain capacity” without “big brain cost per thought”.

Why MoE scales so well

Normally:

More parameters = more compute every forward pass

With MoE:

More parameters = more memory only
Compute stays roughly constant per token

So you can scale model capacity very aggressively.

This is why MoE broke the old assumption that:

parameter count ≈ compute cost ≈ capability

That relationship is no longer linear.

MoE vs dense models

Dense model (classic GPT-style):

Every parameter is used every time
Predictable behaviour
Expensive as it grows

MoE model:

Only subset of parameters used per token
Much cheaper per inference step
More complex training and routing dynamics

Dense = simple but expensive
MoE = efficient but messy

Real-world examples

Mistral AI → Mixtral models are classic open MoE implementations (e.g. Mixtral 8x7B, 8x22B style designs)
OpenAI → GPT-4 is widely believed (not fully confirmed publicly) to use MoE-like routing internally in some variants
Google → Gemini models are also widely associated with MoE-style scaling strategies in parts of their architecture

Key point: MoE is now standard for frontier-scale efficiency, not an experimental idea.

Trade-offs: memory vs compute

Pros:

Huge parameter counts (trillions become feasible)
Lower inference compute per token
Specialisation (experts can specialise in domains)

Cons:

High memory footprint (you still store everything)
Routing instability (bad gating = bad performance)
Training complexity (load balancing across experts is hard)
Less predictable behaviour than dense models

Why parameter count stopped meaning much

Before MoE:

1B params = roughly twice 500M cost

After MoE:

1T params might behave like a 50B compute model per token

So:

Parameter count now reflects capacity in storage space
Not actual compute cost or runtime intelligence

This is why modern model comparisons based only on “number of parameters” are misleading.
friendly
0
funny
0
informative
0
agree
0
disagree
0
like
0
dislike
0
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
Our Discord Server invitation link is https://discord.gg/jB2qmRrxyD

Tags:
 

Related Topics

  Subject / Started by Replies Last post
0 Replies
26597 Views
Last post August 12, 2015, 11:05:39 AM
by Chip
0 Replies
21462 Views
Last post October 10, 2016, 07:42:02 PM
by Chip
0 Replies
22637 Views
Last post May 10, 2019, 08:24:58 PM
by Chip
0 Replies
27220 Views
Last post March 13, 2020, 09:07:36 AM
by Chip
0 Replies
22824 Views
Last post April 20, 2021, 09:39:40 PM
by Chip


dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse





TERMS AND CONDITIONS

In no event will d&u or any person involved in creating, producing, or distributing site information be liable for any direct, indirect, incidental, punitive, special or consequential damages arising out of the use of or inability to use d&u. You agree to indemnify and hold harmless d&u, its domain founders, sponsors, maintainers, server administrators, volunteers and contributors from and against all liability, claims, damages, costs and expenses, including legal fees, that arise directly or indirectly from the use of any part of the d&u site.


TO USE THIS WEBSITE YOU MUST AGREE TO THE TERMS AND CONDITIONS ABOVE


Founded December 2014
SimplePortal 2.3.6 © 2008-2014, SimplePortal