我们在 RTX 3090 上使用 Qwen3.5-27B 获得了每秒 207 个 token 的速度。
We got 207 tok/s with Qwen3.5-27B on an RTX 3090

原始链接: https://github.com/Luce-Org/lucebox-hub

Lucebox 是一个专注于通过手动优化软件以适应特定硬件,从而优化大型语言模型 (LLM) 推理的项目,而不是依赖通用框架。他们的目标是使强大的 AI 在本地更易于访问,优先考虑隐私、成本效益和避免厂商锁定。 目前,Lucebox 发布了两个项目:**Megakernel Qwen3.5 0.8B**,适用于 RTX 3090 GPU,性能与苹果最新的芯片相匹配,吞吐量提高 2 倍(1.87 tok/J);以及同样适用于 RTX 3090 的 **DFlash DDtree Qwen3.5 27B**,使用推测解码和定制 CUDA 引擎,速度比自回归解码快高达 5.46 倍(207 tok/s)。 这两个项目都提供了详细的基准测试、说明文档,并以 MIT 许可证在 GitHub 上提供。它们利用了内核融合、推测解码和量化等技术,针对目标硬件进行优化,以最大限度地提高性能。未来的工作包括对 Ryzen AI MAX+ 处理器和异构 CPU/GPU 系统的优化。

## Qwen3.5-27B 在 RTX 3090 上达到 207 tok/s 新的实现方式在 Qwen3.5-27B 语言模型上取得了令人印象深刻的性能,在单个 RTX 3090 GPU 上达到 **每秒 207 个 token**。这是通过使用自定义 C++/ggml 推测解码器和“DFlash”块扩散草案实现的,显著优于标准自回归解码(快 5.46 倍)和现有的 SGLang AWQ 实现(快 2.8 倍)。 速度的关键在于针对 24GB 显卡进行优化,利用 KV 压缩至 Q4_0 和滚动特征缓冲区等技术,从而实现 **128K 上下文处理**。开发者专注于 ggml,避免依赖仅 CUDA 的解决方案,如 vLLM,旨在提高可访问性。 改进包括优化的内核和错误修复,从而提高了性能。未来的工作包括守护进程模式以加快初始响应速度,支持贪婪解码之外的采样方法,以及探索更高的量化级别。该项目是开源的(MIT 许可证),可在 GitHub 上获取,并计划在 Qwen3.6-27B 发布后进行适配。开发者明确表示他们不会添加 Metal/Vulkan 支持,将此留给潜在的分支。
相关文章

原文

Lucebox

lucebox.com Discord Blog

MIT CUDA 12+ C++17

Open LLM inference, rewritten by hand for one specific chip at a time.
Kernels, speculative decoding, and quantization, tailored per target.
We don't wait for better silicon. We rewrite the software.


Two projects today, more coming. Each one is a self-contained release with its own benchmarks and paper-style writeup.

Megakernel    DFlash 27B


01 · Megakernel Qwen3.5 0.8B on RTX 3090

The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers of Qwen 3.5-0.8B in a single CUDA dispatch, 1.87 tok/J on a 2020 GPU, matching Apple's latest silicon at 2× the throughput.

# 1. clone + enter
git clone https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub/megakernel

# 2. install (Python 3.10+, CUDA 12+, PyTorch 2.0+). Weights stream from HF on first run.
pip install -e .

# 3. run the benchmark (prefill pp520 + decode tg128 vs llama.cpp BF16 + PyTorch HF)
python final_bench.py
Method Prefill pp520 Decode tg128 tok/J
Megakernel @220W 37,800 413 1.87
llama.cpp BF16 @350W 11,247 267 0.76
PyTorch HF 7,578 108 n/a

What makes it work: 82 blocks, 512 threads, one persistent kernel. No CPU round-trips between layers. Weights streamed straight from HuggingFace. Cooperative grid sync instead of ~100 kernel launches per token. Power ceiling hit before compute ceiling, so DVFS converts tight execution straight into saved watts.

Full writeup → · Benchmarks → · Blog post →


02 · DFlash DDtree Qwen3.5 27B GGUF on RTX 3090

First GGUF port of DFlash speculative decoding. Qwen3.5-27B on a single RTX 3090, Q4_K_M target + BF16 draft, DDTree budget=22.

  • Up to 207 tok/s in the demo (207.6 tok/s DFlash vs 38.0 tok/s AR, 5.46×)
  • 129.5 tok/s mean on the HumanEval 10-prompt bench
  • 3.43× faster than autoregressive (+15% over chain speculative decoding)
  • 2.8× faster than SGLang AWQ on the same hardware
  • 128K context in 24 GB (134.78 tok/s at ctx=131072)
# 1. clone with submodules (pulls the pinned Luce-Org/llama.cpp@luce-dflash fork)
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub/dflash

# 2. build the C++/CUDA decoder (~3 min on sm_86, CUDA 12+, CMake 3.18+)
cmake -B build -S . -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_dflash -j

# 3. fetch weights: ~16 GB Q4_K_M target + 3.46 GB bf16 draft
huggingface-cli download unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download z-lab/Qwen3.5-27B-DFlash model.safetensors --local-dir models/draft/

# 4a. one-shot streaming generate
python3 scripts/run.py --prompt "def fibonacci(n):"

# 4b. or reproduce the paper-style bench (HumanEval + GSM8K + Math500, ~15 min)
python3 scripts/bench_llm.py
Benchmark AR (tok/s) DFlash+DDTree (tok/s) Speedup
HumanEval 37.8 129.5 3.43×
Math500 37.7 110.5 2.93×
GSM8K 37.7 96.2 2.55×

The constraint that shaped the project. AWQ INT4 of Qwen3.5-27B plus the BF16 draft doesn't leave room for the DDTree verify state on a 24 GB card. Q4_K_M GGUF (~16 GB target) is the largest format that fits target + 3.46 GB draft + budget=22 tree state + KV cache in 24 GB on the RTX 3090. Picking it forced a new port on top of ggml, since no public DFlash runtime supports a GGUF target.

What we built vs what we didn't. The algorithms are not ours:

  • DFlash (z-lab, 2026): block-diffusion draft conditioned on target hidden states.
  • DDTree (Ringel et al., 2026): tree-structured verify that beats chain verify at the same compute budget.

What we ported and tuned:

  • C++/CUDA decode engine on top of ggml (no libllama, no Python runtime, Q4_K_M target path).
  • Three custom CUDA kernels for tree-aware SSM state rollback: ggml_ssm_conv_tree, ggml_gated_delta_net_tree, ggml_gated_delta_net_tree_persist.
  • DDTree budget swept for RTX 3090 + Q4_K_M target: budget=22 is the sweet spot.
  • Q4_0 KV cache + sliding target_feat ring to fit 128K context in 24 GB with ~3% AL hit.

Full writeup → · Benchmarks → · Blog post →


Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't.

General-purpose frameworks dominated the last decade because hand-tuning kernels per chip was too expensive to justify. One stack, decent on everything, great on nothing. Most of the silicon's capability stays on the floor.

AI-assisted development flips that calculus. Rewrites that took a quarter now fit in a release cycle. Lucebox is where we publish them, one chip and one model family at a time. MIT source, full writeup, reproducible benchmarks.


NVIDIA GPU (Ampere+, sm_86+), CUDA 12+, PyTorch 2.0+. Tested on RTX 3090 (2020). dflash needs CMake 3.18+ and --recurse-submodules for the pinned Luce-Org/llama.cpp@luce-dflash fork (three tree-mode ggml ops).

Optional, find your GPU's sweet spot: sudo nvidia-smi -pl 220 (megakernel hits best tok/J at 220 W).


lucebox-hub/
├── megakernel/    · fused forward pass for Qwen 3.5-0.8B
├── dflash/        · DFlash speculative decoding port for Qwen 3.5-27B on RTX 3090
└── assets/        · banners, cards, diagrams

  Q1 2026    ▮▮▮▮▮▮▮▮▮▮    RTX 3090 kernels & optimizations
  Q2 2026    ▮▮▮▮▮▯▯▯▯▯    Ryzen AI MAX+ 395 optimizations
  Q2 2026    ▮▮▯▯▯▯▯▯▯▯    Heterogeneous CPU + GPU latency optimizations

@software{lucebox_2026,
  title  = {Lucebox: Open LLM Inference, Rewritten by Hand for One Specific Chip at a Time},
  author = {Lucebox},
  url    = {https://github.com/Luce-Org/lucebox-hub},
  year   = {2026}
}

Per-project citations live in each subproject's README.




MIT · Lucebox.com

联系我们 contact @ memedata.com