从基础开始的量化

原文

Qwen-3-Coder-Next is an 80 billion parameter model 159.4GB in size. That's roughly how much RAM you would need to run it, and that's before thinking about long context windows. This is not considered a big model. Rumors have it that frontier models have over 1 trillion parameters, which would require at least 2TB of RAM. The last time I saw that much RAM in one machine was never.

But what if I told you we can make LLMs 4x smaller and 2x faster, enough to run very capable models on your laptop, all while losing only 5-10% accuracy.

That's the magic of quantization.

In this post, you are going to learn:

How a model's parameters make it so big
How floating point precision works and how models sacrifice it
How to compress floats using quantization
How to measure model quality loss after quantization

If you already know what parameters are and how floats are stored, feel free to skip straight to quantization.

What makes large language models so large?

Parameters, also called "weights," are the majority of what an LLM is when it's in memory or on disk. In my prompt caching post I wrote that LLMs are an "enormous graph of billions of carefully arranged operations." What do those graphs look like? Let's start with the simplest example: 1 input, 1 parameter, 1 output.

It doesn't look like much, but this is the fundamental building block of modern AI. It takes the input of 2.0, multiplies it by the parameter 0.5, and gets the output 1.0.

LLMs, though, are much bigger. They have billions of these parameters in practice. One of the ways they get so big is that they have "layers." Here's how that looks.

These two nodes in the middle are a layer. They both show 1.0 because both connections have a parameter of 0.5, and so they are the result of 2.0 * 0.5. Every connection between two nodes gets a parameter, so we have 4 in total above. When 2 connections end at the same node, the values are added together. So to get the output of 1.5 we add together 1.0 * 1.0 and 0.5 * 1.0.

Play with the slider next to the input below to get a feel for how it affects the output.

Use the input slider to change the input value. The output updates automatically.

Output value 2.5

This is still only 6 parameters, we're a long way from the billions seen in modern LLMs. The example below has 2 inputs, 3 layers, and 2 outputs. In total it has 64 parameters, hover or tap a node to see its parameters.

Modern LLMs have hundreds of thousands of inputs and outputs. They have many dozens of layers, each with thousands of nodes, all densely connected together. This all multiplies out to result in billions, sometimes trillions of parameters.

How do computers store numbers?

Computers work in 1s and 0s, called "bits." Here's what a whole number, or "integer," looks like when stored as bits. Drag the slider to change the value. You can also tap on individual bits to flip them.

unsigned int8

Each bit represents a power-of-2 number which when summed together gives the final answer.

Integers are nice to work with because they are discrete. Between 1 and 3 there is exactly one number: 2. This is great for computers, they can represent discrete values no problem.

It gets trickier when you start thinking about decimal places. How many decimal place numbers are there between 1 and 3? There are an infinite number of them. This is not good for computers, because computers can't represent an infinite number of things.

What computers do is they compromise. They promise to be accurate up to so many significant figures, and anything after that is best-effort.

For example, 32-bit floating point numbers span the range ±3.40×10³⁸ with 7 significant figures of accuracy. They do this by dividing the 32 bits up into 3 parts: 1 sign bit, 8 exponent bits, and 23 significand bits. More exponent bits results in a larger range, while more significand bits results in more significant figures of accuracy.

Play with the example below. Sliding left and right explores the full range of values. The plus and minus buttons at the top jump to the next higher or lower representable number. The 7 significant figures that are promised to be accurate are underlined. Press reset to return to the default value of 1.5.

float32

Whenever you press plus, it takes you to the next highest representable value. Pay attention to the digit after the underlined digits, it will sometimes skip over a number. This is the precision compromise in action.

Pressing plus when you're on a very small value moves you forward a small amount. Pressing it when you're on a very large value moves you forward a larger amount. The size of the jump changes depending on where you are in the range, the values are not evenly distributed.

To illustrate this point further, below is a histogram. Each bar shows you a slice of the 32-bit floating point range. There are over 2 billion unique values that can be represented between -0.5 and 0.5.

Distribution of 32-bit float values

A lot of the representable 32-bit floats are small values. This is fantastic for LLMs, because parameters also tend to be small. Small parameters have been found to result in models that generalise better to problems they haven't seen before, so models are rewarded during training for making parameters small.

Below is another histogram, and was created by downloading 6 popular open source models and counting their parameter values. Almost all parameters are very close to 0.

Distribution of model parameter values

So most model parameters sit in the range of floats that can be most precisely represented. There are a very small number of outliers, though. I'll come back to that later.

Can we use smaller floats?

Do language models actually need 32-bit floats? They don't need a wide range, as you can see from the parameter distribution histogram above, and do they really need 7 significant figures of accuracy?

The answer is no, LLMs work just fine with smaller, less accurate floats. Below is an example of a 16-bit float. It works just like the 32-bit float, except it only has 5 exponent bits and 10 significand bits. It can represent 3 significant figures of precision, and has a range of ±65504. It also takes up half as much RAM and disk as a 32-bit float.

float16

We can mix and match the number of exponent and significand bits to get different precision/range tradeoffs. For example, the Google Brain team created the bfloat16 format, which has 8 exponent bits but only 7 significand bits. This gives it a very wide range, but only 2 significant figures of precision.

bfloat16

Google found that 2 significant figures is sufficient for creating LLMs, and having this extremely wide range means they don't have to worry about any calculations overflowing, which can happen in larger LLMs using smaller floats.

Some more extreme examples that are seen less often are float8 and float4. Below are just example configurations of these floats, in the wild people will mix and match the number of exponent and significand bits to suit their needs.

float8

float4

Another way to visualise the accuracy of these formats is to see how well they approximate a sine wave. Use the zoom buttons underneath the graph below to see how they differ.

Different floats approximating sine

Let's talk about how we can use this knowledge to make models smaller.

What is quantization?

Quantization is the process of taking values from a large range, and packing them into a smaller range. It is a form of lossy compression.

When we convert between, e.g., a float16 and a float8, we tend to round to the closest representable value. We're taking values in the float16 range and mapping them to the nearest float8 value. This is called "round-to-nearest" and is one of many types of quantization.

Slide the vertical bar along the number lines below. As you move the bar, the closest representable value for each float size is shown.

This is a very simple way to take a value in one range and represent it inside a smaller range. However, it's not a good idea to do this for LLMs.

Take a look at the small model below. See how the parameters and output change as you round from bfloat16 to float8 and float4.

Quantization by rounding

Use the input slider to change the input value. The output updates automatically.

Output value 0.20

Rounding to float8 isn't too bad, but rounding to float4 completely breaks the model. This is because some of the parameters are now 0. Because there is no path from input to output that doesn't multiply by 0, the output is always 0.

This happens because we're being inefficient about how we squish values into the 4 bits we have available. A float4 goes from -3 to 3, but our parameters go from -0.89 to 0.16.

On top of that, float4 can also represent Infinity and NaN. These aren't useful to us when quantizing.

Let's look at how we can more efficiently use the 16 values a 4-bit number gives us.

Symmetric quantization

Instead of going from -3 to 3 like float4 does, what if we used a tighter range that better fits our data?

One of the ways we can do this is by scaling our data into a new range. For example, if I had data in the range of -14 to 14, and I wanted to fit it into -7 to 7, I could divide by 2.

Then to get back to my original value from the scaled value, all I need to do is multiply it by 2. Any odd value would become unrepresentable in this scheme, as they would have to be rounded to the nearest integer. For example, 5 / 2 = 2.5 would round up to 3 and then dequantize to 3 * 2 = 6. This is what makes quantization lossy.

We can apply this process to any set of values, we just need to find the right scaling factor. We do by finding the largest absolute value in our dataset, and dividing it by the largest value in our quantized range. For our parameters that would be 0.89 / 7. Here's what the code would look like in JavaScript.

function quantize({ values, bits }) {
	// Assuming:
	//   values = [-0.89, 0.16, 0.08, -0.13, 0.16, -0.54]
	//   bits = 4
	const vmax = Math.max(...values.map(Math.abs)); // 0.89
	const qmax = 2 ** (bits - 1) - 1; // 7
	const scale = vmax / qmax; // 0.12714285714285714
	return {
		values: values.map((v) => Math.round(v / scale)),
		scale,
	};
}

function dequantize({ values, scale }) {
	return values.map((v) => v * scale);
}

The code below applies this quantize function to our parameter values from the small model we saw above.

const values = [-0.89, 0.16, 0.08, -0.13, 0.16, -0.54];
const quantized = quantize({ values, bits: 4 });
// { values: [-7, 1, 1, -1, 1, -4], scale: 0.12714285714285714 }

Our small floats have been mapped into the quantized integer range. Here's how that mapping looks visually:

And then we can dequantize the values, multiplying them by the scale we got back from quantize, to get back to the original floating point range:

const dequantized = dequantize(quantized);
// [-0.89, 0.1271, 0.1271, -0.1271, 0.1271, -0.5086]

And again, how that looks visually:

Comparing those values to the originals we can see that they're mostly pretty close!

We've made our model 4x smaller, 16-bit to 4-bit, and on average the dequantized values are off by about 18%. Not bad!

Let's have a look at how our small model performs using symmetric quantization. Play with the slider and flip between bfloat16 and quantized 4-bit and note the differences.

Symmetrically quantized model

Use the input slider to change the input value. The output updates automatically.

Output value 0.20

quantized 4-bit has the final output off by about 30%. This is a huge improvement over the always-zero behaviour of rounded float4, especially when you remember that this quantized model will require 4x less RAM than the original.

But... we can still do better.

Asymmetric quantization

Symmetric quantization is called that because 0 is always in the middle. We pick a scale and squish our values around 0. Let's take a closer look at our symmetrically quantized values.

There's a lot of wasted space on the positive side because our maximum value is 0.16, whereas our minimum value is -0.89. Because of symmetry, the positive range goes up to 0.89, leaving a large gap between 0.16 and 0.89. We aren't efficiently using the space.

The idea behind asymmetric quantization is to fix this by allowing uneven ranges. Instead of squishing around 0, asymmetric quantization squishes around the midpoint of your data, and works out a "zero point" to offset by during dequantization.

function quantize({ values, bits }) {
	// Assuming:
	//   values = [-0.89, 0.16, 0.08, -0.13, 0.16, -0.54]
	//   bits = 4
	const vmax = Math.max(...values); // 0.16
	const vmin = Math.min(...values); // -0.89
	const qmax = 2 ** (bits - 1) - 1; // 7
	const qmin = -(2 ** (bits - 1)); // -8
	const scale = (vmax - vmin) / (qmax - qmin); // 0.07
	const zero = qmin - Math.round(vmin / scale); // 5
	return {
		values: values.map((x) => Math.round(x / scale + zero)),
		scale,
		zero,
	};
}

Let's run our familiar parameters through this and see what we get.

values = [-0.89, 0.16, 0.08, -0.13, 0.16, -0.54];
const quantized = quantize({ values, bits: 4 });
// { values: [ -8, 7, 6, 3, 7, -3 ], scale: 0.07, zero: 5 }

Here's how the mapping looks visually. Our new asymmetric quantize function has decided that 5 should be the zero point.

The dequantize function is only slightly more complicated than the symmetric version, this time requiring a subtraction as well as a multiplication.

function dequantize({ values, scale, zero }) {
	return values.map((x) => scale * (x - zero));
}

dequantize(quantized);
// [-0.91, 0.14, 0.07, -0.14, 0.14, -0.56]

How does this compare to the average error with symmetric quantization?

Much better! 10% less error for the same number of bits. Let's see how it does on our model.

Asymmetrically quantized model

Use the input slider to change the input value. The output updates automatically.

Output value 0.20

The final output is still off by around 10%, but a really nice improvement over symmetric quantization.

This is what's happening to the parameters of models when they're quantized down to sizes that are possible to run on your laptop. Instead of floats, small integers are what get stored and loaded into memory. When the time comes to use the quantized values, to generate an answer to a question for example, the values are dequantized on the fly. You might think this sounds slower, but we'll see later on that this actually ends up being faster as well as smaller.

How is quantization applied in practice?

Are people taking their LLMs with hundreds of billions of parameters, finding the the largest and smallest of all of them, and quantizing the model in one go?

No.

I hinted at the reason for this earlier on. Let's take another look at that graph of parameters from 6 different open weight models, except this time let's look at the long tail of outlier values.

Outlier parameters

All of the models have a small number of outlier parameters. Ones that are much larger or smaller than most others. Outliers are really bad for quantization. Look what happens when we try to quantize our parameters from earlier with a single outlier of 10 added to them:

Everything gets squished into a small number of buckets and average error goes through the roof. If we quantized the entire model in one go, we'd destroy it. What's done in practice is quantization in blocks, usually around 32-256 parameters at a time. This way, the impact of outliers is contained.

To dequantize, we need to save the scale value for symmetric, and the scale + zero for asymmetric. These get stored alongside each block, and are considered overhead. Choosing a larger block size reduces this overhead, but larger blocks have a wider range of values on average, increasing error. It's a trade-off.

It's weird, isn't it? I got fascinated by these values a while back, and if you want to learn more about them I recommend reading:

tl;dr: no one conclusively knows, but a small fraction of these outliers are very important to model quality. Removing even a single "super weight," as Apple calls them, can cause the model to output complete gibberish.

Given their importance, real-world quantization schemes sometimes do extra work to preserve these outliers. They might do this by not quantizing them at all, or by saving their location and value into a separate table, then removing them so that their block isn't destroyed. During dequantization this table is consulted and the outliers are restored.

How much does quantization affect model accuracy?

In this section I'm going to show you a number of ways quality loss in LLMs can be measured. All of these measures have pros and cons. If you're evaluating quantized models for a critcal use-case you have, nothing beats creating your own benchmark for the specific task you're asking the model to perform.

Disclaimers aside, all of the following tests were performed against the Qwen3.5 9B model, and I've put details about all of the commands I ran in the appendix at the end of this post.

Perplexity

What LLMs are doing under the hood is creating probability distributions of what the likely next "token" is for a given prompt. For example, if I prompt Qwen3.5 9B with The answer to 2 + 2 is, it gives me these probabilities for what it thinks the next token should be:

It's given a high probability for the token 4, which makes sense given that's the correct answer. Concerning that it'll say 5 3% of the time, but I digress.

The idea behind "perplexity" as a measurement is to collapse these probability distributions down into a single number that's easy to reason about. Calculating perplexity involves a little bit of math but I promise it's not too bad. For the single prediction above, The answer to 2 + 2 is, we take the probability of the correct token, 4, and we do this:

pCorrect = 0.9229; // 92.28%, probability for `4`
perplexity = Math.exp(-Math.log(pCorrect)); // 1.08

Lower scores are better, and the way you're supposed to read this is "the model considers there to be ~1.08 plausible tokens that complete this prompt." The lower the perplexity, the higher the probability the model gave to the correct token.

When I give Qwen3.5 9B the prompt And then I, the possibilities are more spread out.

If we assume the correct next token is was, the perplexity calculation for this probability distribution becomes:

pCorrect = 0.03; // 3%, probability for `was`
perplexity = Math.exp(-Math.log(pCorrect)); // 33.33

Much higher, but then again: what would you have predicted comes after And then I? It's a far more ambiguous prompt, and it makes sense for models to be less confident predicting it.

I used llama.cpp's llama-perplexity tool to measure the perplexity of Qwen 3.5 9B at different quantization levels. The way it works is you give it a reference text, and it takes a sliding window of tokens over that reference text as the prompt, and uses the next token in the reference text to know what the correct token should be.

To illustrate this sliding window idea, below you can see the prompt tokens and the correct token. Use the forward and backward buttons to move the sliding window.

Sliding prompt window

Correct token: used

At each step, the -Math.log(pCorrect) is accumulated and then at the end we Math.exp the average. If we moved this sliding window across the whole reference text and collected the correct token probabilities at each step into probs, the end calculation would look like this:

let total = 0;
for (const prob of probs) {
	total += -Math.log(prob);
}
const perplexity = Math.exp(total / probs.length);

The llama.cpp project likes to use wikitext-2's test dataset as the reference text, so I did the same. It's just the contents of the Wikipedia page on Robert Boulter, I have no idea why. I wonder if he knows he's used to benchmark quantized LLM quality...

Anyway, let's take a look at the results.

We see almost no change with 8-bit symmetric, small degradation in the 4-bit variants, and then almost complete collapse in the 2-bit variant. Quantization has caused the model to become less confident, considering a wider selection of tokens on average.

Crucially, perplexity only considered the correct token in its calculations. The probability of all of the other tokens is not used. As such, perplexity doesn't capture the full picture of how quantization has affected a model.

KL divergence

Short for "Kullback-Leibler divergence," this is a measurement that tells us how well 2 probability distributions overlap.

Play with the slider below. Try to get the KL divergence number to 0.

KL divergence: 1.125

KL divergence 1.125. Quantized distribution moved right by 1.500 standard deviations.

The only time that KL divergence is 0 is when the 2 distributions exactly overlap. The further they are apart, the higher the KL divergence.

It's not just horizontal skew that increases KL divergence, any sort of non-overlapping will do it.

KL divergence: 0.069

KL divergence 0.069. Quantized distribution moved down to 0.750 times the original height.

This measurement can be applied to the token probability distributions output by an LLM. The distibution below shows the probability of each digit from 0 to 9 to follow the prompt The answer to 2 + 2 is. Toggle between different levels of quantization to see how it changes, along with the KL divergence score at the top.

Line chart comparing next-token probabilities for digits 0 through 9. The original distribution is compared against a quantized distribution. KL divergence: 0.000. Per-token comparison: token 0, original 0.25 percent, 8-bit 0.25 percent; token 1, original 0.9 percent, 8-bit 0.91 percent; token 2, original 0.85 percent, 8-bit 0.87 percent; token 3, original 1.15 percent, 8-bit 1.19 percent; token 4, original 92.3 percent, 8-bit 92 percent; token 5, original 3.23 percent, 8-bit 3.43 percent; token 6, original 0.57 percent, 8-bit 0.57 percent; token 7, original 0.21 percent, 8-bit 0.22 percent; token 8, original 0.39 percent, 8-bit 0.4 percent; token 9, original 0.09 percent, 8-bit 0.09 percent.

KL divergence: 0.000

8-bit

4-bit (asym)

4-bit (sym)

2-bit

There's sadly no intuitive way to think about the KL divergence score other than "higher is worse." There's not even a natural maximum, e.g. we can't say "KL divergence is always between 0 and 1." It differs based on properties of the model. For that reason, it's only valid to compare the score between quantizations of the same model.

The llama-perplexity tool can also be used to measure KL divergence by passing a --kl-divergence flag, details in the appendix. I used wikitext-2 as the reference text again.

While KL divergence has the downsides of being difficult to reason about intuitively, and only comparable between quantizations of the same model, one of the benefits it has over perplexity is that it considers the entire probability distribution of each prediction.

Perplexity only cares about the probability of the correct token. The probability of every other token could change, but if the correct one stayed the same the perplexity wouldn't change. With KL divergence, if the entire distribution changed, the score would be higher. KL divergence captures a fuller picture of how quantization has changed the model's behavior.

Benchmarking

I wrote a whole post on LLM benchmarking, and one of the ways people measure the impact of quantization is to compare a model's score on some benchmarks before and after quantization.

For this post I decided to run the GPQA Diamond benchmark. I wrote about this benchmark if you're interested in learning the full details, but for this post it's enough to know that it's a set of 198 very hard multiple choice questions in biology, chemistry, and physics. Each question has 4 answers to choose from, so random guessing should score 25% on average. Details on how I ran this benchmark are in the appendix.

If you're confused by the results, don't worry. So was I. The 8-bit quantization scores higher than the unquantized model in its original bfloat16 form. It's hard to say exactly why this happens, it's possible that it is simply luck. With multiple choice there's always that 25% chance of being right by accident.

What I'm taking away from these scores is that the 8- and 4-bit quantizations perform well, but the 2-bit quantization has fallen off a cliff. No answer was found for 97% of the questions, which could mean the model got stuck in a loop or didn't understand what it was being asked to do.

Just talk to it

The last test is the easiest but least rigorous: just talk to it! I asked the same question to all of the different quantization levels to see how they would respond. The question was: "What is the capital city of England, UK?"

All correct except for the 2-bit quantization, which refused to answer almost every time I asked it. The "reasoning trace" got a little bit unhinged:

The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of London
The capital city of London
The capital city of London
The capital city of London
The capital city of London
The capital city of London
The capital city of London
The capital city of London

But then it would go on to produce an empty response. I think it's fair to say that 2-bit quantization is too much information loss for Qwen3.5 9B, and it's not a useful model at that level of compression.

Clearly this is not a thorough or scientific test, but it can be useful to help you put all of the other scores into perspective.

How much does quantization affect model speed?

The last thing I want to talk about is model speed. Smaller sizes of quantization tend to also result in faster models. This is mostly due to there being less data to move around inside your GPU.

The llama.cpp project comes to the rescue again here with its llama-bench tool, which I ran both on a MacBook Pro M1 Max and a rented H100 SXM GPU from Runpod. Performance figures are given in "tokens per second," so how fast the model generates responses.

There's a big difference in performance between the original bfloat16 and the 8- and 4-bit quantizations. It's not obvious to me why the 2-bit is slower than 4-bit, though. I would have expected this to be faster. If you know why this is, please reach out and let me know!

Conclusion

The main thing I want you to take away from this post is that quantized models are pretty good, actually.

When I first pitched the idea of writing this post, I didn't know anything about how quantization worked. I had this assumption that model quality would degrade linearly as you compress it. So if you start at bfloat16, an 8-bit quantization of that would be half as good. Then a 4-bit quantization would be half as good as the 8-bit version, and so on.

That doesn't appear to be true.

It looks like 16-bit to 8-bit carries almost no quality penalty. 16-bit to 4-bit is more noticeable, but it's certainly not a quarter as good as the original. Closer to 90%, depending on how you want to measure it.

So don't be afraid to run local models that are quantized. There's a quality cliff at some point, but I've given you the tools to know how to identify where it is. Provided you haven't fallen off that cliff, quantized models work well and you shouldn't shy away from trying them just because they're compressed!

As you do experiment with quantized local models, consider giving ngrok's AI gateway a shot. Use it to route LLM requests to these local models, whether they're running on your laptop or a rented GPU in the cloud.

Appendix

Installing llama.cpp

brew install llama.cpp

Downloading Qwen 3.5 9B

llama-server -hf unsloth/Qwen3.5-9B-GGUF:BF16 --port 8000

This downloads the bfloat16 version of Qwen 3.5 9B, then runs an OpenAI compatible server on port 8000 backed by the model.

Running the GPQA benchmark

You can clone the official GPQA repository and unzip the question set like this:

git clone [email protected]:idavidrein/gpqa.git
cd gpqa
uv venv --python 3.9
uv pip install -r requirements.txt
unzip -P deserted-untie-orchid dataset.zip

But I found that it doesn't let you run the benchmark against an arbitrary local model. I made some modifications (sorry they're a bit noisy, my editor changed a bunch of quotes and stuff) to allow me to run:

OPENAI_BASE_URL=http://localhost:8000/v1 uv run python baselines/run_baseline.py main \
        --model_name qwen3.5-9b-bf16 \
        --data_filename dataset/gpqa_diamond.csv \
        --prompt_type zero_shot \
        --max-tokens 30000 \
        --verbose

I ended up running the benchmark against a llama-server that allowed for 4096 reasoning tokens:

llama-server -hf unsloth/Qwen3.5-9B-GGUF:BF16 --reasoning-budget 4096

Quantizing the model

cd ~/Library/Caches/llama.cpp
llama-quantize unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-BF16.gguf unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-Q8_0.gguf Q8_0

There's a lot going on there, I'll break it down:

~/Library/Caches/llama.cpp: This is the directory where llama.cpp stores models it's downloaded.
unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-BF16.gguf: This is the filename that was given to the model we downloaded when we ran llama-server -hf unsloth/Qwen3.5-9B-GGUF:BF16 earlier.
unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-Q8_0.gguf: This is the filename I want to give to my newly quantized model.
Q8_0: This is the quantization format, llama.cpp has a special format for these strings, the "Q8" here means quantize to 8-bit, and the "_0" means symmetric quantization.

That command only took a minute on my laptop, and the resulting file is just over half the size of the original.

Here are the format strings to use for each of the quantization levels I used throughout the post:

llama.cpp doesn't seem to offer a 8-bit asymmetric or 2-bit symmetric option as far as I can tell.

Also, I know that the 2-bit quantization I'm using is different to the rest. I watched the fantastic reverse-engineering GGUF video from Julia Turc, but it was between K-quant or I-quant, there's no legacy quant available for 2-bit. So I compromised. I also suspect this has something to do with why my llama-bench results for 2-bit were weird.

Measuring perplexity and KL divergence

I used the get-wikitext-2.sh script from the llama.cpp repo to download the wikitext-2 test dataset then ran:

llama-perplexity -hf unsloth/Qwen3.5-9B-GGUF:BF16 -f wikitext-2-raw/wiki.test.raw -c 512 --kl-divergence-base ~/reference.kld

This saves KL divergence data you'll need when measuring the quantized versions of the model. Use it like this:

llama-perplexity -m path/to/quantized/model.gguf -f wikitext-2-raw/wiki.test.raw -c 512 --kl-divergence-base ~/reference.kld --kl-divergence

Talking to models locally

llama-cli -hf unsloth/Qwen3.5-9B-GGUF:BF16

Measuring performance

llama-bench -hf unsloth/Qwen3.5-9B-GGUF:BF16