如何在 macOS 上设置本地编程代理

如何在 macOS 上设置本地编程代理
How to setup a local coding agent on macOS

原始链接: https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent-on-macos

为了提升 macOS 上的编程智能体性能，作者通过 `llama.cpp` 结合 Metal 加速，对本地环境中的 **Gemma 4 26B**（GGUF 格式）模型进行了优化。通过集成**多 Token 预测（MTP）草稿模型**，生成速度从 58.2 token/s 提升至 72.2 token/s（提升 24%），表现优于原生 MLX 实现。 **关键组成：** * **引擎：** `llama.cpp`（构建时支持 Metal/Accelerate）。 * **模型：** Gemma 4 26B-A4B（Q4 量化）搭配 Q8 MTP 草稿头。 * **优化：** 在 M1 Max 上，使用 `--spec-draft-n-max 3` 可达到最佳速度。 * **功能：** 集成的多模态投影仪支持截图分析，兼容 OpenAI 的 `llama-server` 可实现与“Pi”编程智能体的无缝衔接。作者指出，虽然 Qwen 3.6 35B 等替代模型在编程逻辑上更出色，但 Gemma 4 + MTP 的配置在日常智能体工作流中仍是更快速、更具响应性的选择。文中还提供了详细说明，包括用于自动化的 Shell 脚本包装器和 Pi 的配置方案，以实现开箱即用的本地开发体验。

Hacker News 新内容 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交登录如何在 macOS 上设置本地编程代理 (ikyle.me) 16 点，由 kkm 发布于 56 分钟前 | 隐藏 | 往期 | 收藏 | 2 条评论 c-hendricks 2 分钟前 | 下一条 [-] 如果你只是使用 llama.cpp，其实没必要专门用 huggingface-cli 来下载任何东西。你可以直接传入 `-hf ...` 参数，它会自动为你下载模型。设置 `LLAMA_CACHE` 可以更改下载路径： LLAMA_CACHE="models" ./llama-server \ -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \ ... 回复 cdolan 17 分钟前 | 上一条 [-] 有视频链接吗？我访问页面时没有显示。对其实时反馈的感受很感兴趣。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

I'd had my internet fail a few times recently leaving me stranded without a coding agent, and so when I saw the "Gemma 4 now runs 2x faster with MTP" Multi-Token Prediction update for Gemma 4 I decided to have a go at getting it running.

I wanted a local coding agent setup that:

was fast enough to actually use on my Mac
worked through an OpenAI compatible API (so I could use it in other tools)
and preferably could handle screenshots/images when needed, so I can feed it screenshots of what it has made.

And I did! This video is realtime. And shows the agent responding at a perfectly usable speed.

After a bit of testing the final setup I ended up with is:

llama.cpp built with Metal on macOS
Gemma 4 26B-A4B in GGUF format
A Q8 MTP draft model for speculative decoding
The Gemma 4 multimodal projector
Pi as the terminal coding agent

This was tested on an Apple M1 Max with 64 GB unified memory, running macOS 15.7.7.

The main model is: gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf.

Link on Huggingface: models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

That file is about 16 GB. With the MTP draft head and multimodal projector the model folder is about 17 GB.

The benchmark prompt was:

Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.

Each benchmark generated about 128 tokens.

First I ran the main model directly through llama.cpp with Metal acceleration:

repos/llama.cpp/build/bin/llama-cli \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  -ngl 999 \
  -fa on \
  -c 4096 \
  -n 128

Result:

Setup	Prompt tok/s	Generation tok/s
Gemma 4 26B-A4B Q4, llama.cpp Metal	298.0	58.2

58 tokens/second is not fast, but is usable, but for coding-agent work you want it to be as fast as possible, especially when the agent is making many tool calls.

Gemma 4 now has the MTP draft model available:

MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf

This can be loaded by llama.cpp as a speculative draft model:

repos/llama.cpp/build/bin/llama-cli \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 \
  -fa on \
  -c 4096 \
  -n 128

The first run with MTP came in at 69.2 tokens/second using 4 draft tokens. However, Unsloth's guide on How to Run MTP Models includes this note:

"We found --spec-draft-n-max 2 is the best starting point however, do not assume 2 is optimal, as performance is hardware-dependent. Try any value from 1 through 6 and use whichever is fastest for your system."

After sweeping --spec-draft-n-max, the best result was 72.2 tokens/second with 3 draft tokens.

Setup	Prompt tok/s	Generation tok/s	Speedup
Main model only	298.0	58.2	1.00x
Main model + Q8 MTP draft	295.6	72.2	1.24x

The useful part is that prompt processing stayed basically the same, while generation improved by about 24%.

I tested --spec-draft-n-max values from 1 to 6.

`--spec-draft-n-max`	Prompt tok/s	Generation tok/s
1	295.5	68.4
2	299.1	72.0
3	295.6	72.2
4	297.3	70.7
5	297.9	63.7
6	296.3	61.2

On my M1 Max machine, 3 was the fastest, with 2 close enough that either would be fine. Values above that got slower.

I also tested MLX models through mlx-lm, to find out which is the faster way to run the model on a Mac, llama.cpp or mlx.

Runtime	Model	Generation tok/s
llama.cpp Metal + MTP	Unsloth GGUF Q4 + Q8 MTP	72.2
llama.cpp Metal	Unsloth GGUF Q4	58.2
MLX-LM	Unsloth UD MLX 4-bit	45.8
MLX-LM	mlx-community 4-bit	43.9
MLX-LM	mlx-community OptiQ 4-bit	38.1

I thought MLX (being optimised for the Mac) would be fastest.
However, for this specific setup, llama.cpp was faster than MLX, and llama.cpp with MTP was clearly the best option.

I guess all the effort and tweaking which has gone into llama.cpp over time means it quite well optimised fr macOS despite being cross platform.

I also tried Gemma 4 MTP through gemma-4-swift-mlx, but the tested 26B 4-bit MLX checkpoints did not match the loader's expected weight keys, and I already had the previous MLX tests, so moved on rather than redownload new models and try to tweak things to match.

For Pi, I also wanted to be able to attach screenshots. The local model entry I setup for it originally declared the model as text-only:

That meant Pi did not send image tool output through to the model properly.

The llama.cpp server also needs the Gemma 4 multimodal projector in order for the multi-modal part to work (only the 12B is natively multi-modal):

When loaded with --mmproj, llama.cpp advertises multimodal support, and Pi can send images.

I re-ran the text benchmark with the projector loaded, just to check it didn't change the speed:

Setup	Projector	Prompt tok/s	Generation tok/s
llama.cpp Metal + MTP	none	120.3	71.4
llama.cpp Metal + MTP	`mmproj-BF16.gguf`	297.4	72.2

The final run with the projector did not show a text-generation slowdown.

Now for setup instructions:

Install dependencies:

brew install cmake git tmux [email protected]

Clone and build llama.cpp:

mkdir -p ~/Developer/ML-Models/Gemma4/repos
cd ~/Developer/ML-Models/Gemma4

git clone https://github.com/ggml-org/llama.cpp repos/llama.cpp

cd repos/llama.cpp
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_METAL=ON \
  -DGGML_ACCELERATE=ON

cmake --build build --config Release -j

The build I tested had:

GGML_METAL=ON
GGML_ACCELERATE=ON
GGML_BLAS=ON
GGML_BLAS_VENDOR=Apple

Create a Python environment:

cd ~/Developer/ML-Models/Gemma4
python3.11 -m venv .venv
source .venv/bin/activate
pip install -U huggingface_hub hf_xet

Download the files:

mkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF

huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
  --local-dir models/unsloth-gemma-4-26B-A4B-it-GGUF

You should end up with:

models/unsloth-gemma-4-26B-A4B-it-GGUF/
  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
  mmproj-BF16.gguf
  MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf

This is the final server command:

repos/llama.cpp/build/bin/llama-server \
  -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
  --mmproj models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 \
  -fa on \
  -c 65536 \
  --parallel 1 \
  --host 127.0.0.1 \
  --port 8080

The OpenAI-compatible endpoint is:

I used a small start_server.sh wrapper so it runs inside tmux:

#!/usr/bin/env bash
set -euo pipefail

ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SESSION_NAME="${SESSION_NAME:-gemma4-server}"
HOST="${HOST:-127.0.0.1}"
PORT="${PORT:-8080}"
CTX_SIZE="${CTX_SIZE:-65536}"
PARALLEL="${PARALLEL:-1}"

LLAMA_SERVER="$ROOT_DIR/repos/llama.cpp/build/bin/llama-server"
MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf"
DRAFT_MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf"
MMPROJ="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf"
LOG_FILE="$ROOT_DIR/logs/llama-server-mtp.log"

mkdir -p "$ROOT_DIR/logs"

tmux new-session -d -s "$SESSION_NAME" -c "$ROOT_DIR" \
  "$LLAMA_SERVER \
    -m '$MODEL' \
    --model-draft '$DRAFT_MODEL' \
    --mmproj '$MMPROJ' \
    --spec-type draft-mtp \
    --spec-draft-n-max 3 \
    -ngl 999 \
    -fa on \
    -c '$CTX_SIZE' \
    --parallel '$PARALLEL' \
    --host '$HOST' \
    --port '$PORT' \
    2>&1 | tee -a '$LOG_FILE'"

Start it:

chmod +x start_server.sh
./start_server.sh

Check that the server is running:

curl http://127.0.0.1:8080/v1/models

Pi reads model providers from:

Add a local provider:

{
  "providers": {
    "gemma4-local": {
      "name": "Gemma 4 Local",
      "baseUrl": "http://127.0.0.1:8080/v1",
      "api": "openai-completions",
      "apiKey": "local",
      "authHeader": false,
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
          "name": "Gemma 4 26B-A4B Q4 + MTP",
          "reasoning": false,
          "input": ["text", "image"],
          "contextWindow": 65536,
          "maxTokens": 8192,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    }
  }
}

The important pieces are:

baseUrl points to the llama.cpp OpenAI-compatible server.
api is openai-completions.
authHeader is false, because this is a local server.
input includes both text and image, otherwise Pi treats it as text-only.

Optionally make it the default in:

~/.pi/agent/settings.json

{
  "defaultProvider": "gemma4-local",
  "defaultModel": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
  "defaultThinkingLevel": "minimal"
}

Then check Pi can see it:

pi --offline --list-models gemma

Expected:

provider      model                               context  max-out  thinking  images
gemma4-local  gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf  65.5K    8.2K     no        yes

Run Pi using the local model:

pi --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

Or use non-interactive mode:

pi -p --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  "Explain what this repository does"

For screenshots:

pi -p @"/path/to/screenshot.png" "Describe this image and point out anything relevant to the UI"

The final local coding-agent stack was:

Layer	Choice
Inference runtime	llama.cpp
macOS acceleration	Metal + Accelerate
Main model	`gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf`
Draft model	`gemma-4-26B-A4B-it-Q8_0-MTP.gguf`
MTP setting	`--spec-draft-n-max 3`
Multimodal projector	`mmproj-BF16.gguf`
Server	`llama-server` on `127.0.0.1:8080`
API	OpenAI-compatible `/v1`
Coding agent	Pi
Pi model input	`["text", "image"]`

The main conclusion was that the MTP draft model is worth using. On this machine it took Gemma 4 from 58.2 tokens/second to 72.2 tokens/second, while keeping the setup simple enough to run as a local OpenAI-compatible server.

P.S: Some suggested using Qwen3.6 35B-A3B instead of Gemma 4 26B-A4B. According to the benchmarks I can find, Qwen is a much better coding agent than Gemma 4.
However, it is also slower. Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf + unsloth-Qwen3.6-35B-A3B-MTP-GGUF + mmproj-BF16.gguf results in 55 tk/s, instead of 72 tk/s. Which is quite significant when you are sitting waiting for it.

Download the models:

mkdir -p models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF

huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF \
  Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  mmproj-BF16.gguf \
  --local-dir models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF

Start the server:

LLAMA_SERVER=/Users/kylehowells/Developer/ML-Models/Gemma4/repos/llama.cpp/build/bin/llama-server

$LLAMA_SERVER \
  -m models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  --mmproj models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ngl 999 \
  -fa on \
  -c 65536 \
  --parallel 1 \
  --host 127.0.0.1 \
  --port 8081

Pi Config:

{
  "providers": {
    "qwen36-local": {
      "name": "Qwen3.6 Local",
      "baseUrl": "http://127.0.0.1:8081/v1",
      "api": "openai-completions",
      "apiKey": "local",
      "authHeader": false,
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf",
          "name": "Qwen3.6 35B-A3B Q4 + MTP",
          "reasoning": true,
          "input": ["text", "image"],
          "contextWindow": 65536,
          "maxTokens": 8192,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    }
  }
}

如何在 macOS 上设置本地编程代理 How to setup a local coding agent on macOS

如何在 macOS 上设置本地编程代理
How to setup a local coding agent on macOS