Ollama 现在由 MLX 在 Apple Silicon 上提供预览版。
Ollama is now powered by MLX on Apple Silicon in preview

原始链接: https://ollama.com/blog/mlx

## Ollama 0.19:在 Apple Silicon 上更快的大语言模型 Ollama 的最新版本 (0.19) 通过利用 Apple 的 MLX 框架,为在 Apple Silicon 上运行大型语言模型提供了显著的性能提升。这带来了巨大的加速——在 M5 芯片上高达 1810 个 token/秒的预填充速度,这得益于利用 GPU 神经加速器。 主要更新包括对 NVIDIA 的 NVFP4 格式的支持,该格式提供更高质量的响应并减少内存使用,以及改进的缓存机制。这些缓存增强功能——智能检查点、更智能的驱逐和缓存重用——提高了响应速度,尤其是在编码和代理任务中。 Ollama 0.19 最初加速了 Qwen3.5-35B-A3B 模型(针对编码进行了优化),并且需要具有 32GB+ 统一内存的 Mac。未来的开发重点是扩展模型支持并简化自定义模型导入。此版本标志着在 Apple 设备上实现高效强大的本地 LLM 推理迈出了重要一步。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Ollama 现在由 MLX 在 Apple Silicon 上提供预览版 (ollama.com) 14 分,由 redundantly 发表于 52 分钟前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式 搜索:
相关文章

原文

Illustration of Ollama standing beside a fast car that can be run both as a daily driver and go on to win races. Ollama is here to demonstrate high performance on Apple silicon

Today, we’re previewing the fastest way to run Ollama on Apple silicon, powered by MLX, Apple’s machine learning framework.

This unlocks new performance to accelerate your most demanding work on macOS:

Fastest performance on Apple silicon, powered by MLX

Ollama on Apple silicon is now built on top of Apple’s machine learning framework, MLX, to take advantage of its unified memory architecture.

This results in a large speedup of Ollama on all Apple Silicon devices. On Apple’s M5, M5 Pro and M5 Max chips, Ollama leverages the new GPU Neural Accelerators to accelerate both time to first token (TTFT) and generation speed (tokens per second).

Prefill performance

0 500 1000 1500 2000 tokens/s 1810 Ollama 0.19 1154 Ollama 0.18

Decode performance

0 40 80 120 160 tokens/s 112 Ollama 0.19 58 Ollama 0.18

Testing was conducted on March 29, 2026, using Alibaba’s Qwen3.5-35B-A3B model quantized to `NVFP4` and Ollama’s previous implementation quantized to `Q4_K_M` using Ollama 0.18. Ollama 0.19 will see even higher performance (1851 token/s prefill and 134 token/s decode when running with `int4`).

NVFP4 support: higher quality responses and production parity

Ollama now leverages NVIDIA’s NVFP4 format to maintain model accuracy while reducing memory bandwidth and storage requirements for inference workloads.

As more inference providers scale inference using NVFP4 format, this allows Ollama users to share the same results as they would in a production environment.

It further opens up Ollama to have the ability to run models optimized by NVIDIA’s model optimizer. Other precisions will be made available based on the design and usage intent from Ollama’s research and hardware partners.

Improved caching for more responsiveness

Ollama’s cache has been upgraded to make coding and agentic tasks more efficient.

  • Lower memory utilization: Ollama will now reuse its cache across conversations, meaning less memory utilization and more cache hits when branching when using a shared system prompt with tools like Claude Code.

  • Intelligent checkpoints: Ollama will now store snapshots of its cache at intelligent locations in the prompt, resulting in less prompt processing and faster responses.

  • Smarter eviction: shared prefixes survive longer even when older branches are dropped.

Get started

Download Ollama 0.19

This preview release of Ollama accelerates the new Qwen3.5-35B-A3B model, with sampling parameters tuned for coding tasks.

Please make sure you have a Mac with more than 32GB of unified memory.

Claude Code:

ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

OpenClaw:

ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4

Chat with the model:

ollama run qwen3.5:35b-a3b-coding-nvfp4

Future models

We are actively working to support future models. For users with custom models fine-tuned on supported architectures, we will introduce an easier way to import models into Ollama. In the meantime, we will expand the list of supported architectures.

Acknowledgments

Thank you to:

  • The MLX contributor team who built an incredible acceleration framework
  • NVIDIA contributors to NVFP4 quantization, NVFP4 model optimizer, MLX CUDA support, Ollama optimizations and testing
  • The GGML & llama.cpp team who built a thriving local framework and community
  • The Alibaba Qwen team for open-sourcing excellent models and their collaboration
联系我们 contact @ memedata.com