大型语言模型的先进量化算法
Advanced Quantization Algorithm for LLMs

原始链接: https://github.com/intel/auto-round

## AutoRound:用于LLM和VLM的超低比特量化 AutoRound是一个强大的工具包,可以将大型语言模型(LLM)和视觉语言模型(VLM)量化到2-4比特,同时最大限度地减少精度损失。它利用符号梯度下降进行优化舍入,并提供广泛的硬件兼容性(CPU、CUDA、Intel GPU、Habana Gaudi)。 主要特性包括卓越的精度——即使在4比特下也能达到领先水平——以及与Transformers、vLLM和SGLang等流行框架的无缝集成。AutoRound支持多种导出格式(AutoRound、AutoAWQ、AutoGPTQ、GGUF),并能够快速生成混合精度方案。 最近的更新包括对FP8量化、MTP层以及针对GGUF和INT2量化的增强算法的支持。它支持10多个VLM,并提供可定制的配方,用于优化速度或精度。AutoRound正在积极扩展对MXFP和NVFP等新型数据类型支持。安装通过pip非常简单,并提供夜间构建和特定硬件选项。 有关详细用法和高级配置,请参阅用户指南和项目仓库。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 大型语言模型的先进量化算法 (github.com/intel) lastdong 发表于 3 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 netdur 发表于 8 分钟前 [–] 嗯… 在 Q4_K_M 时,股票式量化保留了约 99–99.8% 的 BF16 精度,AutoRound 将其推高到约 99.4–100.n% (??) 差距大约是 0.1–0.7 个百分点 https://github.com/intel/auto-round/blob/main/docs/gguf_alg_... 回复 考虑申请 YC 2026 年夏季批次!申请截止至 5 月 4 日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式 搜索:
相关文章

原文

AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers SignRoundV1 and SignRoundV2 for more details. For usage instructions, please refer to the User Guide.

AutoRound Overview

  • [2026/03] Block-wise FP8 quantization is available via --scheme FP8_BLOCK --iters 0 --disable_opt_rtn.

  • [2026/03] MTP layer quantization has been supported in this PR

  • [2025/12] The SignRoundV2 paper is available. Turn on enable_alg_ext and use the AutoScheme API for mixed-precision quantization to reproduce the results: Paper, Notes for evaluating LLaMA models.

  • [2025/11] AutoRound has landed in LLM-Compressor: Usage, vLLM blog, RedHat blog, X post, Intel blog, Linkedin, 微信, 知乎.

  • [2025/11] An enhanced GGUF quantization algorithm is available via --enable_alg_ext: Accuracy.

  • [2025/10] AutoRound has been integrated into SGLang: Usage, LMSYS Blog, X post, Intel blog, Linkedin.

  • [2025/10] A mixed precision algorithm is available to generate schemes in minutes: Usage, Accuracy.

  • [2025/09] MXFP4 and NVFP4 dtypes is available: Accuracy.

  • [2025/08] An improved INT2 algorithm is available via --enable_alg_ext: Accuracy

  • [2025/07] GGUF format is supported: Usage.

  • [2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.

  • [2025/05] AutoRound has been integrated into Transformers: Blog.

  • [2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy: Model.

Superior Accuracy Delivers strong performance even at 2–3 bits example models, with leading results at 4 bits benchmark.

Ecosystem Integration Seamlessly works with Transformers, vLLM, SGLang and more.

Multiple Formats Export Support AutoRound, AutoAWQ, AutoGPTQ, and GGUF for maximum compatibility. Details are shown in export formats

Fast Mixed Bits/Dtypes Scheme Generation Automatically configure in minutes, with about 1.1X-1.5X the model’s BF16 RAM size as overhead. Accuracy results and user guide.

Optimized Round-to-Nearest Mode Use --iters 0 for fast quantization with some accuracy drop for 4 bits. Details are shown in opt_rtn mode

Affordable Quantization Cost Quantize 7B models in about 10 minutes on a single GPU. Details are shown in quantization costs

10+ VLMs Support Out-of-the-box quantization for 10+ vision-language models example models, support matrix

Multiple Recipes Choose from auto-round-best, auto-round, and auto-round-light to suit your needs. Details are shown in quantization recipes

✅ Advanced Utilities Includes multiple gpus quantization, multiple calibration datasets and support for 10+ runtime backends.

✅ Beyond weight only quantization. We are actively expanding support for additional datatypes such as MXFP, NVFP, W8A8, and more.

# CPU(Xeon)/GPU(CUDA)
pip install auto-round

# CPU(Xeon)/GPU(CUDA) nightly
pip install auto-round-nightly

# HPU(Gaudi)
# install inside the hpu docker container, e.g. vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest  
pip install auto-round-hpu

# XPU(Intel GPU)
pip install torch --index-url https://download.pytorch.org/whl/xpu
pip install auto-round
Build from Source
# CPU(Xeon)/GPU(CUDA)
pip install .

# HPU(Gaudi)
python setup.py install hpu

# XPU(Intel GPU)
pip install torch --index-url https://download.pytorch.org/whl/xpu
pip install .

Model Quantization (CPU/Intel GPU/Gaudi/CUDA)

If you encounter issues during quantization, try using pure RTN mode with iters=0, disable_opt_rtn=True. Additionally, using group_size=32 or mixed bits is recommended for better results.

The full list of supported arguments is provided by calling auto-round -h on the terminal.

ModelScope is supported for model downloads, simply set AR_USE_MODELSCOPE=1.

auto-round \
    --model Qwen/Qwen3-0.6B \
    --scheme "W4A16" \
    --format "auto_round" \
    --output_dir ./tmp_autoround

We offer another two recipes, auto-round-best and auto-round-light, designed for optimal accuracy and improved speed, respectively. Details are as follows.

Other Recipes
# Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto-round-best \
  --model Qwen/Qwen3-0.6B \
  --scheme "W4A16" \
  --low_gpu_mem_usage 
# 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2
auto-round-light \
  --model Qwen/Qwen3-0.6B \
  --scheme "W4A16" 

In conclusion, we recommend using auto-round for W4A16 and auto-round-best with enable_alg_ext for W2A16. However, you may adjust the configuration to suit your specific requirements and available resources.

from auto_round import AutoRound

# Load a model (supports FP8/BF16/FP16/FP32)
model_name_or_path = "Qwen/Qwen3-0.6B"

# Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc.
ar = AutoRound(model_name_or_path, scheme="W4A16")

# Highest accuracy (4–5× slower).
# `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower.
# ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)

# Faster quantization (2–3× speedup) with slight accuracy drop at W4G128.
# ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)

# Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc.
ar.quantize_and_save(output_dir="./qmodel", format="auto_round")
Important Hyperparameters
Quantization Scheme & Configuration
  • scheme (str|dict|AutoScheme): The predefined quantization keys, e.g. W4A16, MXFP4, NVFP4, GGUF:Q4_K_M. For MXFP4/NVFP4, we recommend exporting to LLM-Compressor format.
  • bits (int): Number of bits for quantization (default is None). If not None, it will override the scheme setting.
  • group_size (int): Size of the quantization group (default is None). If not None, it will override the scheme setting.
  • sym (bool): Whether to use symmetric quantization (default is None). If not None, it will override the scheme setting.
  • layer_config (dict): Configuration for layer_wise scheme (default is None), mainly for customized mixed schemes.
  • enable_alg_ext (bool): [Experimental Feature] Only for iters>0. Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is False.

  • disable_opt_rtn (bool|None): Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is None. If None, it defaults to False in most cases to improve accuracy, but may be set to True due to known issues.

Tuning Process Parameters
  • iters (int): Number of tuning iterations (default is 200). Common values: 0 (RTN mode), 50 (with lr=5e-3 recommended), 1000. Higher values increase accuracy but slow down tuning.
  • lr (float): The learning rate for rounding value (default is None). When None, it will be set to 1.0/iters automatically.
  • batch_size (int): Batch size for training (default is 8). 4 is also commonly used.
  • enable_deterministic_algorithms (bool): Whether to enable deterministic algorithms for reproducibility (default is False).
  • dataset (str|list|tuple|torch.utils.data.DataLoader): The dataset for tuning (default is "NeelNanda/pile-10k"). Supports local JSON files and dataset combinations, e.g. "./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test".
  • nsamples (int): Number of samples for tuning (default is 128).
  • seqlen (int): Data length of the sequence for tuning (default is 2048).
Device/Speed Configuration
  • enable_torch_compile (bool): If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource.
  • low_gpu_mem_usage (bool): Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is False).
  • low_cpu_mem_usage (bool): [Experimental Feature]Whether to enable saving immediately to reduce ram usage (default is True).
  • device_map (str|dict|int): The device to be used for tuning, e.g., auto, cpu, cuda, 0,1,2 (default is 0). When using auto, it will try to use all available GPUs.
Details > Gray indicates the absence of a kernel or the presence of only an inefficient/reference kernel. BF16 is mainly for AutoScheme
Format Supported Schemes
auto_round W4A16(Recommended), W2A16, W3A16, W8A16, W2A16G64, W2A16G32, MXFP4, MXFP8, MXFP4_RCEIL, MXFP8_RCEIL, NVFP4, FPW8A16, FP8_STATIC, BF16
auto_awq W4A16(Recommended), BF16
auto_gptq W4A16(Recommended), W2A16, W3A16, W8A16, W2A16G64, W2A16G32,BF16
llm_compressor NVFP4(Recommended), MXFP4, MXFP8, FPW8A16, FP8_STATIC, FP8_BLOCK, INT8, W4A16, W8A16
gguf GGUF:Q4_K_M(Recommended), GGUF:Q2_K_S, GGUF:Q3_K_S, GGUF:Q3_K_M, GGUF:Q3_K_L, GGUF:Q4_K_S, GGUF:Q5_K_S, GGUF:Q5_K_M, GGUF:Q6_K, GGUF:Q4_0, GGUF:Q4_1, GGUF:Q5_0, GGUF:Q5_1,GGUF:Q8_0
fake all schemes (only for research)

Adaptive Schemes (Experimental Feature)

AutoScheme provides an automatic algorithm to generate adaptive mixed bits/data-type quantization recipes. Please refer to the user guide for more details on AutoScheme.

from auto_round import AutoRound, AutoScheme

model_name = "Qwen/Qwen3-8B"
avg_bits = 3.0
scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True)
layer_config = {"lm_head": "GGUF:Q6_K"}

# Change iters to 200 for non-GGUF schemes
ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0)
ar.quantize_and_save()
Important Hyperparameters of AutoScheme
AutoScheme Hyperparameters
  • avg_bits (float): Target average bit-width for the entire model. Only quantized layers are included in the average bit calculation.
  • options (str | list[str] | list[QuantizationScheme]): Candidate quantization schemes to choose from. It can be a single comma-separated string (e.g., "W4A16,W2A16"), a list of strings (e.g., ["W4A16", "W2A16"]), or a list of QuantizationScheme objects.
  • ignore_scale_zp_bits (bool): Only supported in API usage. Determines whether to exclude the bits of scale and zero-point from the average bit-width calculation (default: False).
  • shared_layers (Iterable[Iterable[str]], optional): Only supported in API usage. Defines groups of layers that share quantization settings.
  • batch_size (int, optional): Only supported in API usage. Can be set to 1 to reduce VRAM usage at the expense of longer tuning time.
Click to expand

This feature is experimental and may be subject to changes.

By default, AutoRound only quantize the text module of VLMs and uses NeelNanda/pile-10k for calibration. To quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature is limited. For more information, please refer to the AutoRound readme.

from auto_round import AutoRound

# Load the model
model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct"
# Quantize the model
ar = AutoRound(model_name_or_path, scheme="W4A16")
output_dir = "./qmodel"
ar.quantize_and_save(output_dir)

vLLM (CPU/Intel GPU/CUDA)

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95)
model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
llm = LLM(model=model_name)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Please note that support for the MoE models and visual language models is currently limited.

import sglang as sgl

llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound")
prompts = [
    "Hello, my name is",
]
sampling_params = {"temperature": 0.6, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Transformers (CPU/Intel GPU/Gaudi/CUDA)

AutoRound supports 10+ backends and automatically selects the best available backend based on the installed libraries and prompts the user to install additional libraries when a better backend is found.

Please avoid manually moving the quantized model to a different device (e.g., model.to('cpu')) during inference, as this may cause unexpected exceptions.

The support for Gaudi device is limited.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs (2025.12 paper)

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLM (2023.09 paper)

TEQ: Trainable Equivalent Transformation for Quantization of LLMs (2023.10 paper)

Effective Post-Training Quantization for Large Language Models (2023.04 blog)

Check out Full Publication List.

Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.

If you find AutoRound helpful, please ⭐ star the repo and share it with your community!

联系我们 contact @ memedata.com