如何在本地运行文心一言3.5
How to run Qwen 3.5 locally

原始链接: https://unsloth.ai/docs/models/qwen3.5

## Qwen3.5 LLM 总结 阿里巴巴的Qwen3.5是一个强大的、多模态LLM的新系列,提供多种尺寸——从较小的0.8B到巨大的397B参数模型。这些模型在编码、视觉、聊天和长文本任务方面表现出色,支持201种语言的256K上下文。 最近的更新包括改进的量化算法和数据,以提高在聊天、编码和工具调用方面的性能。35B和27B模型可以在22GB的设备上运行,而397B模型需要高达256GB的RAM,并采用优化的量化。 主要特性包括支持通过参数调节的“思考”和“非思考”模式,以及与llama.cpp和LM Studio的兼容性。Unsloth动态量化提供最先进的性能,并且有工具可用于使用llama-server部署Qwen3.5以供生产使用。基准测试显示出色的性能,量化版本即使在大幅减少内存的情况下也能保持高精度。

## Qwen 3.5:强大的LLM现在可以在本地运行 最近Hacker News上的讨论强调了Qwen 3.5大型语言模型的出色性能,特别是它在消费级硬件上有效运行的能力。 用户报告在ASUS 5070ti 16G上实现了约100 tokens/秒的速度,在速度和输出质量上都超过了许多在线LLM服务。即使是35B模型在8GB RTX 3050上也能良好运行,证明其在编码任务方面具有能力。 该模型在OCR和文本格式化等领域表现出能力,并且一个专门的“coder”版本正在成功地自动化HTML和CSS任务。虽然一些用户在使用旧GPU时遇到内存分配问题,但总体共识是Qwen 3.5代表着在可访问、高质量本地LLM操作方面迈出了重要一步。
相关文章

原文

Run the new Qwen3.5 LLMs including Medium: Qwen3.5-35B-A3B, 27B, 122B-A10B, Small: Qwen3.5-0.8B, 2B, 4B, 9B and 397B-A17B on your local device!

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device. See all GGUFs herearrow-up-right.

circle-check

All uploads use Unsloth Dynamic 2.0arrow-up-right for SOTA quantization performance - so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access. You can also fine-tune Qwen3.5 with Unsloth.

35B-A3B27B122B-A10B397B-A17BFine-tune Qwen3.50.8B • 2B • 4B • 9B

⚙️ Usage Guide

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Qwen3.5

3-bit

4-bit

6-bit

8-bit

BF16

circle-check

Between 27B and 35B-A3B, use 27B if you want slightly more accurate results and can't fit in your device. Go for 35B-A3B if you want much faster inference.

  • Maximum context window: 262,144 (can be extended to 1M via YaRN)

  • presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance

  • Adequate Output Length: 32,768 tokens for most queries

circle-info

If you're getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help.

As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode have different settings:

Thinking mode:

General tasks

Precise coding tasks (e.g. WebDev)

repeat penalty = disabled or 1.0

repeat penalty = disabled or 1.0

Thinking mode for general tasks: