如何在本地运行文心一言3.5

如何在本地运行文心一言3.5
How to run Qwen 3.5 locally

原始链接: https://unsloth.ai/docs/models/qwen3.5

## Qwen3.5 LLM 总结阿里巴巴的Qwen3.5是一个强大的、多模态LLM的新系列，提供多种尺寸——从较小的0.8B到巨大的397B参数模型。这些模型在编码、视觉、聊天和长文本任务方面表现出色，支持201种语言的256K上下文。最近的更新包括改进的量化算法和数据，以提高在聊天、编码和工具调用方面的性能。35B和27B模型可以在22GB的设备上运行，而397B模型需要高达256GB的RAM，并采用优化的量化。主要特性包括支持通过参数调节的“思考”和“非思考”模式，以及与llama.cpp和LM Studio的兼容性。Unsloth动态量化提供最先进的性能，并且有工具可用于使用llama-server部署Qwen3.5以供生产使用。基准测试显示出色的性能，量化版本即使在大幅减少内存的情况下也能保持高精度。

## Qwen 3.5：强大的LLM现在可以在本地运行最近Hacker News上的讨论强调了Qwen 3.5大型语言模型的出色性能，特别是它在消费级硬件上有效运行的能力。用户报告在ASUS 5070ti 16G上实现了约100 tokens/秒的速度，在速度和输出质量上都超过了许多在线LLM服务。即使是35B模型在8GB RTX 3050上也能良好运行，证明其在编码任务方面具有能力。该模型在OCR和文本格式化等领域表现出能力，并且一个专门的“coder”版本正在成功地自动化HTML和CSS任务。虽然一些用户在使用旧GPU时遇到内存分配问题，但总体共识是Qwen 3.5代表着在可访问、高质量本地LLM操作方面迈出了重要一步。

原文

Run the new Qwen3.5 LLMs including Medium: Qwen3.5-35B-A3B, 27B, 122B-A10B, Small: Qwen3.5-0.8B, 2B, 4B, 9B and 397B-A17B on your local device!

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device. See all GGUFs here.

Mar 5 Update: Redownload Qwen3.5-35B, 27B, 122B and 397B.

All GGUFs now updated with an improved quantization algorithm.
All use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.
Tool-calling improved following our chat template fixes. Fix is universal and applies to any Qwen3.5 format and any uploader.
We're retiring MXFP4 layers from 3 Qwen3.5 GGUFs: Q2_K_XL, Q3_K_XL and Q4_K_XL.

All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access. You can also fine-tune Qwen3.5 with Unsloth.

35B-A3B 27B 122B-A10B 397B-A17B Fine-tune Qwen3.5 0.8B • 2B • 4B • 9B

⚙️ Usage Guide

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Qwen3.5

3-bit

4-bit

6-bit

8-bit

BF16

For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower.

Between 27B and 35B-A3B, use 27B if you want slightly more accurate results and can't fit in your device. Go for 35B-A3B if you want much faster inference.

Recommended Settings

Maximum context window: 262,144 (can be extended to 1M via YaRN)
presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance
Adequate Output Length: 32,768 tokens for most queries

If you're getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help.

As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode have different settings:

Thinking mode:

General tasks

Precise coding tasks (e.g. WebDev)

repeat penalty = disabled or 1.0

Thinking mode for general tasks: