Qwen 3.6 27B 是本地开发的最佳平衡点。

Qwen 3.6 27B 是本地开发的最佳平衡点。
Qwen 3.6 27B is the sweet spot for local development

原始链接: https://quesma.com/blog/qwen-36-is-awesome/

Qwen 3.6 是本地 AI 领域的一个重要里程碑，其性能足以媲美顶尖模型。作者重点介绍了两个版本：混合专家模型 35B A3B（速度更快）和稠密模型 27B（质量更高，是推荐选择）。尽管硬件要求较高，但这些模型在复杂推理、创意写作和实际编码任务中表现出色。使用 `llama.cpp` 和 Hugging Face 上的 GGUF 量化权重，在本地运行这些模型非常简单。作者演示了如何设置 `llama-server` 以实现工具调用及与编码智能体的集成，并指出即使在消费级硬件上，其性能依然令人印象深刻。作者认为，本地模型具有显著优势：数据隐私、摆脱对服务商的依赖，以及能够针对特定需求进行微调。随着技术进步——例如近期出现的 GLM 5.2——在本地运行“顶尖级别”智能已成为现实，这标志着 AI 正在向更私密、更具韧性的方向发展。尽管运行这些模型对硬件负载较高，但作者总结认为，所获得的智能水平和自主掌控权完全值得这一投入。

最近的一场 Hacker News 讨论指出，Qwen 27B 模型是本地 AI 辅助开发的一个“最佳平衡点”。用户反馈称 Qwen 在编码任务中表现出极高的胜任力，特别是在与 Llama 3 等模型的对比中——一些用户认为 Llama 3 更倾向于“模拟通用人工智能（AGI）”而非关注实际用途。技术讨论集中在硬件要求上，用户确认 Qwen 27B 可以在消费级硬件（如配备 64GB 内存的 M1 Max）上运行，尽管性能会因配置而异。虽然一些贡献者认为较小的模型（如 12B 或 16B 变体）对于大多数本地开发需求来说更为理想，但大家一致认为，中等规模的模型在性能和易用性之间达成了最佳平衡。讨论帖还提及了用户对昂贵的芯片成本以及模型迭代速度过快的无奈，同时也表现出对未来版本发布的热切期待。总而言之，社区认为 Qwen 对于希望将本地 AI 集成到工作流中的开发者来说，是一个高效且能力出众的替代方案。

原文

I’ve been disappointed by local models in the past. But then I checked Qwen 3.6, and I was in awe. For me it’s the first local model that actually makes sense as a general intelligence.

It comes in two variants, a mixture-of-experts model Qwen 3.6 35B A3B, and a dense Qwen 3.6 27B - slower, but more powerful. The one I recommend!

Let me share my impressions, and show that you can run it too.

It’s hot, literally. When my knees started to melt, I grabbed a phone-attached thermal camera and took a photo.

Qwen 3.6, rightfully, got a lot of coverage on Hacker News. The most common statement about Qwen 3.6 27B is that it punches above its weight - see Will it Mythos?. And I think it is a well-deserved sentiment. It will make your computer hot, but it’s worth it!

Testing the waters

Simon Willison uses “penguins on a bicycle” as a smoke test (see for Qwen 3.6 35B A3B and then Qwen 3.6 27B). I usually go with constrained writing.

A year ago these kinds of things were state of the art, needing a unique, and insanely expensive GPT-4.5, see vibe translating Quantum Flytrap.

I also asked it to write an 8 line poem about Zouk dance and quantum physics, see the transcript. The thought process made sense, both in terms of deliberation on quantum terms, and rhymes.

Then I asked in OpenCode to create a hexagonal minesweeper using pnpm. It worked:

Hexagonal minesweeper in with Qwen 3.6 27B in OpenCode

It worked on the first go, from a single prompt, with a proper Node package. The mixture-of-experts Qwen 3.6 35B A3B was faster… but ignored my instruction to create a package, and did it in a single index.html.

Real work

Sure, creative writing about quantum mechanics, or yet another clone of a minesweeper, is rarely a day job. But Qwen 3.6 27B is decent at regular tasks as well.

Prompt by a friend, Maciej Cielecki, at AI Tinkerers Warsaw.

It worked for a few minutes and created this:

A landing page by Qwen 3.6

By standards of current frontier models, it’s unremarkable. But it is already a practical job. It worked, was reactive, defaults were nice - all from a single, short prompt.

Running Qwen 3.6 locally with llama.cpp

Running local models is easier than ever. A few CLI lines and you’re off.

I recommend llama.cpp - a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly - I would recommend against using that on ethical grounds.

First, we go to Hugging Face, to get proper quantization, i.e. a model with reduced size - popular ones are by unsloth or bartowski, among others. Default models usually come with BF16 precision. A common 8-bit quantization saves half the space at almost no cost to quality. Going further down the road, models are smaller (and potentially - faster), but at the cost of quality, see this comparison for 27B and another one for 35B A3B.

We grab unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0, an 8-bit quantization with support for multi-token prediction (MTP).

llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \
    --spec-type draft-mtp -ngl 999 -fa on -c 65536 --jinja --port 8080

What it does:

-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 grabs from Hugging Face, on the next runs will reuse that
-m ~/models/Qwen3.6-27B-Q8_0.gguf use instead if you already have it
draft-mtp we use a fast model to predict subsequent tokens, speeds up things
-ngl 999 for putting all layers to GPU
-fa on flash attention is on
-c 65536 context size set to 64k tokens (this we can tweak, as Qwen 3.6 27B native context is 256k)
--jinja for tool calling support
--port 8080 better to pin port, as it will be used by other configs

If you open http://127.0.0.1:8080, you can directly chat with it.

Precisely the same server can be used for vibe coding. Choice of agent depends both on one’s goal and subjective taste - for an all-around OpenCode, minimalistic Pi, and self-improving Hermes.

For OpenCode, it is as simple as adding to ~/.config/opencode/opencode.jsonc:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama": {
      "name": "llama.cpp (local)",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1",
        "apiKey": "local"
      },
      "models": {
        "qwen3.6-27b": { "name": "Qwen3.6-27B Q8 +MTP" }
      }
    }
  },
  "model": "llama/qwen3.6-27b"
}

If you just want to chat and are a big fan of Terminal, instead of llama-server use llama-cli:

 llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \
                -ngl 999 -fa on -c 65536 --jinja

Measuring performance

Is it fast enough?

I ran a few tests (source is here) on my Macbook Max M5 128 GB, running it with and without multi-token prediction, and comparing both with the 35B A3B model, and also a quantized DeepSeek V4 Flash version DwarfStar4.

DeepSeek-V4-Flash · Q2–Q4

30 tokens per second is not bad, well within typical frontier model API range. While mlx-lm is precisely targeted at Apple Silicon devices, and AI agents heavily recommend it, llama.cpp turned out to be faster. It was using 95% of GPU, which means it is efficiently using available resources.

Macbook Max M5 is a beast (at least for a laptop), but on other devices it should also work decently. For consumer Nvidia RTX cards, on one hand models need to be quantized, on the other, it is even faster.

I set this up today on my 5090 at Q6_K quantization and Q4_0 KV, got 50 tokens/s consistently at 123k context, using ~28/32gb vram through LM Studio. - gfosco on the Hacker News

While 35B A3B is 3x faster, I prefer 27B. I’d rather generate a third as much code, but of higher quality.

How do they relate to previous state of the art models?

Manual inspection is great, but benchmarks help with grounding intuitions. Here is the score from Artificial Analysis, comparing it with frontier models:

Gemma 4 31B

≈ late 2024

o1 / Claude 3.5 Sonnet

Qwen3.6-35B-A3B

≈ early 2025

o3 / Claude 4 Sonnet

Qwen3.6-27B

≈ mid 2025

GPT-5 / Claude Sonnet 4.5

DeepSeek-V4-Flash

≈ late 2025

GPT-5.2 / Claude Opus 4.5

A few more benchmarks are in these notes, but the spirit is similar. Added here Gemma 4 31B, as a lot of people use this as the default for local coding. But both benchmarks and general sentiment online favour Qwen 3.6 27B by a large margin.

Here there is a caveat - 8-bit quantization likely does not affect results much, but DwarfStar4 uses much more aggressive ones for DeepSeek V4 Flash, 2-4 bit. For sure it is worse than the full model. My personal impression is that within these quantizations Qwen 3.6 27B is as good as (or maybe slightly better than) DwarfStar4. Though, I won’t be surprised if for longer context projects DS4 has an edge.

What’s next

I think we are entering a fascinating era, when it becomes feasible to run one’s own models.

The change will be propelled further by the state of proprietary frontier models. Claude Fable 5 was taken down. Other frontier models run at a massive subsidy, where paying $100 a month gives us thousands worth in tokens. Let’s use the discount while it lasts!

A locally set model can be fine-tuned to our needs, and cannot be taken away. Businesses can use them for proprietary and sensitive data. We can use them personally for offline projects, or when we don’t feel comfortable sharing our deepest secrets, or medical data, with the US or China.

With the release of frontier-level open-weight GLM 5.2, there is a new era. While Qwen 3.6 was the stepping stone, even frontier GLM 5.2 can be run locally. It won’t run on your Macbook or a single RTX 5090. But still, it is manageable with a company budget.

Moreover, I strongly believe that we will have models smarter than current state of the art, while runnable on local devices, maybe even smartphones. Current models combine both raw intelligence and factual knowledge in the same weights. Future models will likely separate that, offloading a lot of knowledge to tool calling.