dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse


Our Discord Notification Server invitation link is https://discord.gg/jB2qmRrxyD

Author Topic: Quantisation and Model Compression  (Read 9 times)

Offline Chip (OP)

  • Server Admin
  • Hero Member
  • *****
  • Administrator
  • *****
  • Join Date: Dec 2014
  • Location: Australia
  • Posts: 7149
  • Reputation Power: 0
  • Chip has hidden their reputation power
  • Gender: Male
  • Last Login:Yesterday at 11:27:06 PM
  • Deeply Confused Learner
  • Profession: IT Engineer now retired
Quantisation and Model Compression
« on: Yesterday at 10:44:04 PM »




Quantisation and Model Compression

Why This Matters

GPT-4 is estimated to have somewhere in the region of a trillion parameters.

Each parameter is a number. A weight. A tiny learned value that, multiplied across billions of connections, produces the behaviour you see when you chat with an AI.

Storing and running those numbers requires memory. Enormous amounts of it. And the precision with which you store each number determines both how much memory you need and how accurately the model behaves.

Quantisation is the art of storing those numbers less precisely — and getting away with it.

What a Weight Actually Is

When a neural network is trained, every connection between neurons gets assigned a weight. This is just a floating point number — a decimal value, positive or negative, usually small.

Something like: [icode]-0.00312847291...[/icode]

During training, these weights are stored at full precision. That means 32 bits per number — called FP32 (32-bit floating point). This gives you about 7 significant decimal digits of precision.

A model with 7 billion parameters stored in FP32 takes up roughly:

Code: [Select]
7,000,000,000 × 4 bytes = 28 GB of RAM

That exceeds the VRAM of most consumer GPUs entirely. You simply cannot run it without specialist hardware.

This is the problem quantisation solves.

The Precision Ladder

Rather than storing weights at full 32-bit precision, you can round them to coarser representations:

FP32 — 32-bit float. Full training precision. 4 bytes per weight.
FP16 — 16-bit float. Half precision. 2 bytes per weight. Minimal quality loss. Used widely in inference already.
BF16 — Brain Float 16. A variant that preserves FP32's exponent range, sacrificing mantissa precision. Preferred for training stability. Common on modern hardware.
INT8 — 8-bit integer. 1 byte per weight. Noticeable but often acceptable quality reduction. Roughly half the memory of FP16.
INT4 — 4-bit integer. 0.5 bytes per weight. Aggressive compression. Quality degrades more visibly, especially on complex reasoning tasks.
INT2 / 1-bit — Experimental. Weights are essentially binary or ternary values. Microsoft's BitNet research explores this space. Not yet mainstream.

The tradeoff is always the same: less memory, faster inference, but lower fidelity.

How It Actually Works

The naive approach would be to just round every weight to the nearest integer. That destroys the model — the subtle fractional differences between weights carry enormous amounts of learned information.

The practical approaches are more sophisticated:

Absmax Quantisation
Find the maximum absolute value in a tensor. Scale all values relative to that maximum. Map to the integer range. Store the scale factor alongside the quantised weights so you can approximately reconstruct the original values during inference.

Zero-Point Quantisation
Handles asymmetric distributions — where weights cluster unevenly around zero. Shifts the distribution before scaling so the integer range is used more efficiently.

Block-wise Quantisation
Rather than computing one scale factor for an entire layer (which loses local detail), divide weights into small blocks and quantise each block independently with its own scale factor. Better quality, small overhead. Used in most modern quantisation schemes.

GPTQ
A post-training quantisation method that uses second-order information — essentially the curvature of the loss function — to compensate for the errors introduced by quantisation. Works layer by layer. Produces much better quality than naive rounding, especially at 4-bit. The weights are adjusted slightly to account for the precision loss before they are locked in.

QLoRA
Combines quantisation with LoRA (Low-Rank Adaptation) fine-tuning. Lets you fine-tune a quantised model on consumer hardware by training only a small set of adapter weights rather than the full model. This is how most hobbyist fine-tuning happens.

GGUF and GGML

If you have ever downloaded a local model, you have almost certainly encountered files ending in .gguf.

GGML was a tensor library created by Georgi Gerganov — the same developer who wrote llama.cpp. It provided a simple file format for storing quantised models and a runtime for running them efficiently on CPU.

GGUF (GPT-Generated Unified Format) replaced it. It is a self-contained binary format that stores:

  • The quantised model weights
  • The tokeniser vocabulary and merge rules
  • Model architecture metadata
  • Quantisation parameters needed to reconstruct approximate original values at inference time

Everything needed to run the model in a single file. No external dependencies, no separate tokeniser file, no architecture config JSON. This portability is why the format became dominant for local deployment.

GGUF filenames typically encode the quantisation level:

Code: [Select]
mistral-7b-instruct-v0.2.Q4_K_M.gguf

Breaking that down:
  • mistral-7b-instruct-v0.2 — the model and version
  • Q4 — 4-bit quantisation
  • K — k-quants method (block-wise with mixed precision)
  • M — Medium variant. S = Small (more compressed), L = Large (less compressed, better quality)
K-quants are worth understanding. Rather than quantising every layer to the same precision, k-quant variants use higher precision for layers that are more sensitive to rounding errors — particularly the attention layers — and lower precision elsewhere. The result is a better quality-to-size ratio than uniform quantisation.

llama.cpp

llama.cpp is a C++ inference engine, also by Georgi Gerganov, initially written to run Meta's LLaMA model on a MacBook. It has since expanded to support most major model architectures.

What makes it significant:

  • Runs on CPU with no GPU required
  • Supports Apple Silicon Metal acceleration
  • Supports CUDA for NVIDIA GPUs
  • Supports Vulkan for broader GPU compatibility
  • Loads GGUF files directly
  • Minimal dependencies — compiles to a standalone binary

It effectively democratised local inference. A 7B parameter model quantised to Q4_K_M sits at roughly 4 GB. That fits in the RAM of most computers made in the last decade and runs at perfectly usable speeds on CPU alone — slower than a cloud API, but functional and completely private.

Front-ends like Ollama, LM Studio, and Jan are essentially polished wrappers around llama.cpp (or similar engines) that abstract the command-line interface into something more accessible.

Pruning

A separate but related approach. Rather than reducing the precision of weights, pruning removes weights entirely — setting them to zero on the basis that they contribute minimally to the model's outputs.

Unstructured pruning removes individual weights scattered throughout the network. Can achieve high sparsity with modest quality loss, but sparse matrices are difficult to accelerate efficiently on conventional hardware. The theoretical memory saving does not always translate to real-world speed gains.

Structured pruning removes entire neurons, attention heads, or layers. The resulting model is a genuinely smaller dense network rather than a sparse one, which hardware handles efficiently. Quality loss is more significant but inference benefits are real.

Pruning is often combined with fine-tuning afterward to recover some of the lost quality. The combination of prune-then-fine-tune is sometimes called iterative pruning.

Knowledge Distillation

A fundamentally different approach. Rather than compressing an existing model, you train a smaller model — the student — to mimic the behaviour of a larger one — the teacher.

The student is not trained on raw data labels alone. It is trained on the output distributions of the teacher — the full probability vectors across the vocabulary, not just the single predicted token. These distributions carry richer information than hard labels. The teacher's uncertainty about secondary tokens encodes relationships the student can learn from.

The result is a smaller model that punches above its weight because it has been trained to replicate the reasoning patterns of a much larger one.

Most of the smaller "distilled" models you see — various 1B and 3B parameter models — have been produced or influenced by this method. Microsoft's Phi series is a well-known example, trained on carefully curated high-quality data with explicit distillation influences.

What Degrades and What Survives

Not all capabilities degrade equally under quantisation. Understanding the failure pattern matters:

  • Factual recall — relatively robust even at INT4. The associations are distributed widely enough that coarse weights preserve them.
  • Fluency and grammar — survives well. Surface language generation is robust to quantisation noise.
  • Complex multi-step reasoning — degrades earlier. The precise weight values in attention layers matter more for long chains of inference.
  • Instruction following — moderate degradation. Edge cases in instruction-following are more sensitive than general response quality.
  • Rare knowledge — degrades faster. Information that was weakly represented in training is the first to be lost to quantisation noise.
This is why k-quants preserve attention layer precision — those layers carry the reasoning load.

The Practical Upshot

A 7B model at Q4_K_M is generally the sweet spot for local deployment:
  • ~4 GB on disk and in RAM
  • Runs on CPU, faster with any GPU offloading
  • Quality loss is real but modest for most use cases
  • Usable for summarisation, Q&A, coding assistance, and general chat

A 13B model at Q4_K_M sits around 8 GB — within reach of a mid-range GPU.
A 70B model at Q4_K_M requires roughly 40 GB — needing either a high-end GPU, multiple GPUs, or CPU+RAM inference at slow speeds.

The scaling relationship between parameters, quantisation level, and required memory is what drives most decisions when selecting a local model.

Why This Connects Back

Scaling Laws (the previous topic) explained why making models bigger produced dramatically better results.

Quantisation explains how the rest of us get access to those models anyway — by accepting a calculated, measurable loss of precision in exchange for the ability to run them on hardware that actually exists outside a datacenter.

The two topics together explain the current moment in AI: capabilities concentrated in enormous models, access distributed through aggressive compression.

Generated by Claude.
friendly
0
funny
0
informative
0
agree
0
disagree
0
like
0
dislike
0
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
No reactions
Our Discord Server invitation link is https://discord.gg/jB2qmRrxyD

Tags:
 

Related Topics

  Subject / Started by Replies Last post
0 Replies
22154 Views
Last post December 31, 2018, 11:04:09 AM
by Chip
1 Replies
30138 Views
Last post May 02, 2019, 04:56:27 PM
by Chip
0 Replies
22946 Views
Last post May 04, 2019, 08:22:56 PM
by Chip
0 Replies
25747 Views
Last post May 09, 2019, 03:56:33 PM
by Chip
0 Replies
20130 Views
Last post June 01, 2019, 07:44:16 AM
by Chip
0 Replies
21783 Views
Last post July 06, 2019, 12:19:10 PM
by Chip
0 Replies
23022 Views
Last post July 07, 2019, 11:26:33 AM
by Chip
0 Replies
24154 Views
Last post October 22, 2019, 04:05:58 AM
by Chip
0 Replies
26635 Views
Last post June 11, 2021, 05:21:29 PM
by Chip
0 Replies
20046 Views
Last post January 19, 2024, 02:34:38 PM
by Chip


dopetalk does not endorse any advertised product nor does it accept any liability for it's use or misuse





TERMS AND CONDITIONS

In no event will d&u or any person involved in creating, producing, or distributing site information be liable for any direct, indirect, incidental, punitive, special or consequential damages arising out of the use of or inability to use d&u. You agree to indemnify and hold harmless d&u, its domain founders, sponsors, maintainers, server administrators, volunteers and contributors from and against all liability, claims, damages, costs and expenses, including legal fees, that arise directly or indirectly from the use of any part of the d&u site.


TO USE THIS WEBSITE YOU MUST AGREE TO THE TERMS AND CONDITIONS ABOVE


Founded December 2014
SimplePortal 2.3.6 © 2008-2014, SimplePortal