在本地运行 Google Gemma 4,使用 LM Studio 的新无头 CLI 和 Claude Code。
Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

原始链接: https://ai.georgeliu.com/p/running-google-gemma-4-locally-with

## Gemma 4:强大的本地LLM推理 云端AI API 提供了便利,但也伴随着成本、隐私问题和限制。在本地运行大型语言模型 (LLM) 提供了引人注目的替代方案,适用于代码审查和测试等任务,具有零API成本、数据隐私和持续可用性等优势。谷歌的 Gemma 4,特别是 26B-A4B 模型,由于其高效的混合专家 (MoE) 架构,非常适合这一点。 这种架构在推理过程中仅激活其参数的一小部分,使其能够在消费级硬件上有效运行——在配备 48GB 统一内存的 MacBook Pro 上达到 51 个 token/秒。Gemma 4 有一系列模型,其中 26B-A4B 在性能(在基准测试中接近 31B 密集模型)和资源使用之间取得了平衡。 最近的 LM Studio 更新 (v0.4.0) 启用了无头操作,通过命令行界面实现,使本地 LLM 服务更加灵活。用户甚至可以将 Claude Code 别名为 Gemma 4 本地运行,从而提供离线编码辅助。虽然 Gemma 4 不能完全替代云端 API,但它为本地推理提供了一个强大且私密的解决方案,尤其适用于专注的任务,并展示了 MoE 模型在可访问 AI 方面的潜力。

## 本地LLM推理获得进展:Gemma 4 及新工具 最近的进展使本地运行大型语言模型(LLM)变得更加实用和有吸引力。Gemma 4 的发布,以及 LM Studio 的无头 CLI 和 Claude Code 等工具,正在改变这一领域。用户报告说,本地模型终于“好用”了,不再局限于简单的演示,而是可以集成到实际工具中。 一个关键的进展是将编码代理(如 Claude Code、OpenCode)与底层模型分离。这允许用户轻松地在本地模型和云端模型之间切换,利用本地选项的成本效益和隐私性,以及云服务的强大功能。 讨论强调了低延迟对于这些代理内有效使用工具的重要性——低于 300 毫秒的响应时间至关重要。缓存和高效的数据处理是关键的优化手段。虽然 Claude Code 仍然受欢迎,但 OpenCode 和 Pi 等替代方案正在获得进展,提供更大的灵活性和与各种后端兼容性。 能够在消费级硬件上运行 Qwen3.5 等模型,即使使用 MoE 和卸载到 RAM 等技术,也在扩大对强大 AI 功能的访问。然而,性能可能会因严重依赖 RAM 或磁盘交换而产生的 I/O 瓶颈而受到显著影响。
相关文章

原文

Cloud AI APIs are great until they are not. Rate limits, usage costs, privacy concerns, and network latency all add up. For quick tasks like code review, drafting, or testing prompts, a local model that runs entirely on your hardware has real advantages: zero API costs, no data leaving your machine, and consistent availability.

Google’s Gemma 4 is interesting for local use because of its mixture-of-experts architecture. The 26B parameter model only activates 4B parameters per forward pass, which means it runs well on hardware that could never handle a dense 26B model. On my 14” MacBook Pro M4 Pro with 48 GB of unified memory, it fits comfortably and generates at 51 tokens per second. Though there’s significant slowdowns when used within Claude Code from my experience.

Google released Gemma 4 as a family of four models, not just one. The lineup spans a wide range of hardware targets:

The “E” models (E2B, E4B) use Per-Layer Embeddings to optimize for on-device deployment and are the only variants that support audio input (speech recognition and translation). The 31B dense model is the most capable, scoring 85.2% on MMLU Pro and 89.2% on AIME 2026.

Why I picked the 26B-A4B. The mixture-of-experts architecture is the key. It has 128 experts plus 1 shared expert, but only activates 8 experts (3.8B parameters) per token. A common rule of thumb estimates MoE dense - equivalent quality as roughly sqrt(total x active parameters), which puts this model around 10B effective. In practice, it delivers inference cost comparable to a 4B dense model with quality that punches well above that weight class. On benchmarks, it scores 82.6% on MMLU Pro and 88.3% on AIME 2026, close to the dense 31B (85.2% and 89.2%) while running dramatically faster.

The chart below tells the story. It plots Elo score against total model size on a log scale for recent open-weight models with thinking enabled. The blue-highlighted region in the upper left is where you want to be: high performance, small footprint.

Gemma 4 26B-A4B (Elo ~1441) sits firmly in that zone, punching well above its 25.2B parameter weight. The 31B dense variant scores slightly higher (~1451) but is still remarkably compact. For context, models like Qwen 3.5 397B-A17B (~1450 Elo) and GLM-5 (~1457 Elo) need 100-600B total parameters to reach similar scores. Kimi-K2.5 (~1457 Elo) requires over 1,000B. The 26B-A4B achieves competitive Elo with a fraction of the parameters, which translates directly into lower memory requirements and faster local inference.

This is what makes MoE models transformative for local use. You do not need a cluster or a high-end GPU rig to run a model that competes with 400B+ parameter behemoths. A laptop with 48 GB of unified memory is enough.

For local inference on a 48 GB Mac, this is the sweet spot. The dense 31B would consume more memory and generate tokens slower because every parameter participates in every forward pass. The E4B is lighter but noticeably less capable. The 26B-A4B gives you 256K max context, vision support (useful for analyzing screenshots and diagrams), native function/tool calling, and reasoning with configurable thinking modes, all at 51 tokens/second on my hardware.

LM Studio has been a popular desktop app for running local models for a while. Version 0.4.0 changed the architecture fundamentally by introducing llmster, the core inference engine extracted from the desktop app and packaged as a standalone server.

The practical result: you can now run LM Studio entirely from the command line using the lms CLI. No GUI required. This makes it usable on headless servers, in CI/CD pipelines, SSH sessions, or just for developers who prefer staying in the terminal.

Key additions in 0.4.0:

Install the lms CLI with a single command:

# Linux/Mac
curl -fsSL https://lmstudio.ai/install.sh | bash

# Windows
irm https://lmstudio.ai/install.ps1 | iex

Then start the headless daemon:

lms daemon up

On macOS, update both inference runtimes:

lms runtime update llama.cpp
lms runtime update mlx

With the daemon running, download Google’s Gemma 4 26B model:

lms get google/gemma-4-26b-a4b

The CLI shows you the variant it will download (Q4_K_M quantization by default, 17.99 GB) and asks for confirmation:

   ↓ To download: model google/gemma-4-26b-a4b - 64.75 KB
   └─ ↓ To download: Gemma 4 26B A4B Instruct Q4_K_M [GGUF] - 17.99 GB

About to download 17.99 GB.

? Start download?
❯ Yes
  No
  Change variant selection

If you already have the model, the CLI tells you and shows the load command:

✔ Start download? yes
Model already downloaded. To use, run: lms load google/gemma-4-26b-a4b

List all downloaded models:

lms ls
You have 10 models, taking up 118.17 GB of disk space.

LLM                                   PARAMS     ARCH             SIZE         DEVICE
gemma-3-270m-it-mlx                   270m       gemma3_text      497.80 MB    Local
google/gemma-4-26b-a4b (1 variant)    26B-A4B    gemma4           17.99 GB     Local
gpt-oss-20b-mlx                       20B        gpt_oss          22.26 GB     Local
llama-3.2-1b-instruct                 1B         Llama            712.58 MB    Local
nvidia/nemotron-3-nano (1 variant)    30B        nemotron_h       17.79 GB     Local
openai/gpt-oss-20b (1 variant)        20B        gpt-oss          12.11 GB     Local
qwen/qwen3.5-35b-a3b (1 variant)      35B-A3B    qwen35moe        22.07 GB     Local
qwen2.5-0.5b-instruct-mlx             0.5B       Qwen2            293.99 MB    Local
zai-org/glm-4.7-flash (1 variant)     30B        glm4_moe_lite    24.36 GB     Local

EMBEDDING                               PARAMS    ARCH          SIZE        DEVICE
text-embedding-nomic-embed-text-v1.5              Nomic BERT    84.11 MB    Local

Worth noting: several of these models use mixture-of-experts architectures (Gemma 4, Qwen 3.5, GLM 4.7 Flash). MoE models punch above their weight for local inference because only a fraction of parameters activate per token.

Start a chat session with stats enabled to see performance numbers:

lms chat google/gemma-4-26b-a4b --stats
 ╭─────────────────────────────────────────────────╮
 │ 👾 lms chat                                     │
 │ Type exit or Ctrl+C to quit                     │
 │                                                 │
 │ Chatting with google/gemma-4-26b-a4b            │
 │                                                 │
 │ Try one of the following commands:              │
 │ /model - Load a model (type /model to see list) │
 │ /download - Download a model                    │
 │ /clear - Clear the chat history                 │
 │ /help - Show help information                   │
 ╰─────────────────────────────────────────────────╯

With --stats, you get prediction metrics after each response:

Prediction Stats:
  Stop Reason: eosFound
  Tokens/Second: 51.35
  Time to First Token: 1.551s
  Prompt Tokens: 39
  Predicted Tokens: 176
  Total Tokens: 215

51 tokens/second on a 14” MacBook Pro M4 Pro (48 GB) with a 26B model is solid. Time to first token at 1.5 seconds is responsive enough for interactive use.

See what is currently loaded:

lms ps
IDENTIFIER                MODEL                     STATUS    SIZE        CONTEXT    PARALLEL    DEVICE    TTL
google/gemma-4-26b-a4b    google/gemma-4-26b-a4b    IDLE      17.99 GB    48000      2           Local     60m / 1h

The model occupies 17.99 GB in memory with a 48K context window and supports 2 parallel requests. The TTL (time-to-live) auto-unloads the model after 1 hour of idle time, freeing memory without manual intervention.

For detailed model metadata, pipe through jq:

lms ps --json | jq

Key fields from the JSON output:

Before loading a model, you can estimate memory requirements at different context lengths using --estimate-only. I wrote a small script to test across the full range:

The base model takes about 17.6 GiB regardless of context. Each doubling of context length adds roughly 3-4 GiB. At the default 48K context, you need about 21 GiB. On my 48 GB MacBook Pro, I can push to the full 256K context at 37.48 GiB and still have about 10 GB free for the OS and other apps. A 36 GB Mac could comfortably run 200K context with headroom.

The estimation command is straightforward:

lms load google/gemma-4-26b-a4b --estimate-only --context-length 48000
Model: google/gemma-4-26b-a4b
Context Length: 48,000
Estimated GPU Memory:   21.05 GiB
Estimated Total Memory: 21.05 GiB

Estimate: This model may be loaded based on your resource guardrails settings.

This is useful for capacity planning. If you want to run Gemma 4 alongside other applications, check the estimate at your target context length first.

Here is the full script I used to generate the table above. You can swap in any model name and context length list to profile a different model:

#!/usr/bin/env bash

model="google/gemma-4-26b-a4b"
contexts=(4096 8000 16000 24000 32000 48000 64000 96000 128000 200000 256000)

table_contexts=()
table_gpu=()
table_total=()

for ctx in "${contexts[@]}"; do
  output="$(lms load "$model" --estimate-only --context-length "$ctx" 2>&1)"

  parsed_context="$(printf '%s\n' "$output" | awk -F': ' '/^Context Length:/ {print $2; exit}')"
  parsed_gpu="$(printf '%s\n' "$output" | awk -F': +' '/^Estimated GPU Memory:/ {print $2; exit}')"
  parsed_total="$(printf '%s\n' "$output" | awk -F': +' '/^Estimated Total Memory:/ {print $2; exit}')"

  table_contexts+=("${parsed_context:-$ctx}")
  table_gpu+=("${parsed_gpu:-N/A}")
  table_total+=("${parsed_total:-N/A}")
done

printf '| Model | Context Length | GPU Memory | Total Memory |\n'
printf '|---|---:|---:|---:|\n'
for i in "${!table_contexts[@]}"; do
  printf '| %s | %s | %s | %s |\n' \
    "$model" "${table_contexts[$i]}" "${table_gpu[$i]}" "${table_total[$i]}"
done

The default lms load or lms chat commands pick reasonable defaults, but you can tune several parameters to match your specific hardware and use case. Here is a practical decision framework.

The memory table above is your starting point. Subtract the OS overhead (macOS typically uses 4-6 GB) from your total memory, then find the largest context length that fits.

Load with a specific context length:

lms load google/gemma-4-26b-a4b --context-length 128000

If you are unsure, always run --estimate-only first. It accounts for flash attention and vision model overhead in its calculation.

On Apple Silicon, the unified memory architecture means CPU and GPU share the same memory pool, so --gpu mostly controls how much computation runs on the GPU versus CPU cores. The default auto setting works well, but you can force full GPU offloading:

lms load google/gemma-4-26b-a4b --gpu=1.0

Use --gpu=max to offload everything possible. On discrete GPU systems (Linux/Windows with NVIDIA cards), this becomes more important because GPU VRAM and system RAM are separate. If your model does not fit entirely in VRAM, partial offloading (--gpu=0.5) splits layers between GPU and CPU, trading some speed for the ability to run larger models.

LM Studio supports concurrent inference through continuous batching, where multiple requests are dynamically combined into a single computation batch. This is useful when serving the model to multiple clients or running parallel tool calls. The feature requires the llama.cpp runtime (v2.0.0+) and is not yet available for the MLX backend.

Configure it through the GUI: open the model loader, toggle Manually choose model load parameters, select a model, then toggle Show advanced settings to set Max Concurrent Predictions (defaults to 4). There is no CLI flag for this setting; it is configured through the desktop app or per-model defaults.

Each parallel slot consumes additional memory proportional to the context length, so on memory-constrained systems, reduce the parallel count or lower the context length to compensate. With Gemma 4 on 48 GB, 2 parallel slots at 48K context is a good balance.

The time-to-live setting automatically unloads models after a period of inactivity, freeing memory:

lms load google/gemma-4-26b-a4b --ttl 1800

That sets a 30-minute idle timeout (value is in seconds). The default is 3600 seconds (1 hour). For shared server setups where multiple models might be needed, shorter TTLs help cycle between models without manual lms unload commands. Set TTL to 0 or -1 to disable auto-unloading.

If you always load Gemma 4 with the same settings, save them as per-model defaults through the desktop app. Navigate to My Models, click the gear icon next to the model, and configure your preferred GPU offloading, context size, and flash attention settings. These defaults apply everywhere, including when loading via lms load from the CLI.

LM Studio supports speculative decoding for dense models, which pairs your main model with a smaller “draft” model to speed up generation. The draft model proposes tokens quickly, and the main model verifies them in batch, which is faster than generating each token independently.

However, speculative decoding is problematic for MoE models like Gemma 4 26B-A4B. During verification, the main model must load the union of all experts activated across all speculative tokens. Since different tokens route to different experts, this blows up memory bandwidth usage and can actually slow things down. Benchmarks on Mixtral showed a 39% speedup on code but a 54% slowdown on math with the same settings, meaning no single configuration works reliably. This is an active research area with approaches like MoE-Spec (expert budgeting) and SP-MoE (expert prefetching) working to solve it, and some newer MoE architectures like Qwen 3.5’s hybrid design are more amenable to speculative approaches. For now, skip speculative decoding with Gemma 4 26B-A4B and rely on its already-fast MoE inference instead.

Flash attention is an optimization that reduces memory usage for the KV cache during inference, letting you fit longer context lengths in the same memory. It is available per-model in LM Studio’s settings. For Gemma 4 on Apple Silicon, enabling flash attention can reduce memory usage at higher context lengths by a meaningful margin. The --estimate-only flag accounts for flash attention in its calculations, so check estimates with and without to see the difference.

Everything above used the headless CLI, but LM Studio also ships a full macOS desktop app. The GUI is useful for visual monitoring and quick experiments before committing to a CLI workflow.

The screenshot below shows the desktop app’s server view with Gemma 4 loaded. A few things worth noting:

http://192.168.1.121:1234

The desktop app also supports Gemma 4’s vision capabilities. In below screenshot, you can see the model analyzing an image of the Timezone Scheduler promotional graphic. It correctly identifies the title, world map with timezone color bars, the schedule grid comparing Brisbane/New York/London, feature badges, and the tech stack icons at the bottom. It generated 504 tokens at 54.51 tok/sec with a 3.15s time to first token.

Claude Code alias claude-lm with Google Gemma 4 analysing my Timezones Scheduler benchmark comparison GitHub repository.

The system monitor overlay in the screenshots tells the real story of what local inference looks like on hardware. On my M4 Pro (4 E-Cores + 10 P-Cores, 20 GPU-Cores):

This is what makes Apple Silicon compelling for local LLM work. The unified memory architecture means the CPU and GPU share the same memory pool, so there is no copying data between separate CPU RAM and GPU VRAM like on discrete GPU setups. The model loads once into unified memory and both the CPU and GPU access it directly.

Once a model is loaded, start the local server:

lms server start

This exposes an OpenAI-compatible API at http://localhost:1234/v1. Any tool that works with OpenAI’s API format (Continue, Cursor, custom scripts) can point at your local server instead. LM Studio 0.4.0 also added an Anthropic-compatible endpoint at POST /v1/messages, which means tools that speak the Anthropic protocol can connect directly without an adapter. You can change the port with lms server start --port 8080 if 1234 conflicts with something else.

The server also supports JIT (Just-In-Time) model loading: if a client requests a model that is not currently loaded, LM Studio can auto-load it on demand and auto-unload it after the TTL expires. This is useful for serving multiple models without keeping them all in memory.

To monitor what the server is doing in real time, stream the logs:

lms log stream --source model --stats

This shows each request’s input/output along with tokens/second and latency. For a machine-readable feed, add --json. You can also filter to just server-level events (startup, endpoint hits) with --source server.

Combined with the headless daemon, you can run this on a dedicated machine and serve models across your network. The server is reachable at your machine’s local IP (e.g.,

http://192.168.1.121:1234

), so other devices on the same network can use it as a shared inference endpoint. If you need access control, enable Require Authentication in server settings and generate API tokens with per-token permissions, accessed via the standard Authorization: Bearer $LM_API_TOKEN header.

The Anthropic-compatible endpoint opens up an interesting use case: running Claude Code against a local model instead of the Anthropic API. This means fully offline, zero-cost coding assistance with no data leaving your machine.

I set up a shell function in ~/.zshrc called claude-lm that configures all the necessary environment variables and launches Claude Code pointed at the local LM Studio server:

claude-lm() {
    export ANTHROPIC_BASE_URL=http://localhost:1234
    export ANTHROPIC_AUTH_TOKEN=lmstudio
    export CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY="2"
    export CLAUDE_CODE_NO_FLICKER="0"
    export ANTHROPIC_MODEL="gemma-4-26b-a4b"
    export CLAUDE_CODE_AUTO_COMPACT_WINDOW="48000"
    export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE="90"
    export ANTHROPIC_DEFAULT_OPUS_MODEL="google/gemma-4-26b-a4b"
    export ANTHROPIC_DEFAULT_SONNET_MODEL="google/gemma-4-26b-a4b"
    export ANTHROPIC_DEFAULT_HAIKU_MODEL="google/gemma-4-26b-a4b"
    export CLAUDE_CODE_SUBAGENT_MODEL="google/gemma-4-26b-a4b"
    export API_TIMEOUT_MS="30000000"
    export BASH_DEFAULT_TIMEOUT_MS="2400000"
    export BASH_MAX_TIMEOUT_MS="2500000"
    export CLAUDE_CODE_MAX_OUTPUT_TOKENS="8000"
    export CLAUDE_CODE_FILE_READ_MAX_OUTPUT_TOKENS="8000"
    export CLAUDE_CODE_ATTRIBUTION_HEADER="0"
    export CLAUDE_CODE_DISABLE_1M_CONTEXT="1"
    export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING="1"
    claude "$@"
}

What the key variables do:

After adding this to ~/.zshrc and running source ~/.zshrc, you can start a fully local Claude Code session with:

claude-lm

It works like normal Claude Code but every request stays on your machine. The trade-off is speed: Gemma 4 at 51 tok/sec is noticeably slower than the Anthropic API for large code generation tasks, but for code review, small edits, and exploration it is perfectly usable.

MoE models are the sweet spot for local inference. Gemma 4's 26B-A4B architecture (26B total, 4B active) delivers roughly 10B dense - equivalent quality at 4B inference cost. Look for similar MoE models when choosing what to run locally.

The headless daemon changes the workflow. Before 0.4.0, LM Studio required the desktop app open. Now lms daemon up runs in the background and you interact entirely through the CLI or API. This makes it practical for server deployments and SSH sessions.

Context length is the main memory variable. The model itself takes a fixed ~17.6 GiB. Context scaling is roughly linear, so you can pick exactly the trade-off you want between context window and available memory.

--estimate-only prevents surprises. Always check memory estimates before loading a large model at an aggressive context length. It takes a second and saves you from OOM situations.

The Anthropic-compatible endpoint is a game changer. Being able to point Claude Code at a local model with a shell alias means you can switch between cloud and local inference depending on the task. Privacy-sensitive code review, offline work, or just saving API costs on exploratory sessions all benefit.

Gemma 4 does not identify itself by name in lms chat. When asked “what model are you?”, it responds generically as “an AI assistant.” This is a minor limitation of how LM Studio handles system prompts, not a Gemma issue. You can override this with a custom system prompt.

The default 48K context is conservative for a model that supports 256K. If you have the memory, it is worth loading with a higher context length for tasks like long document analysis or multi-file code review.

Running Claude Code with a local model is not a drop-in replacement for the Anthropic API. Complex multi-step tasks that rely on Claude’s extended thinking or very large context windows will hit limitations. The local setup works best for focused, single-file tasks where the 48K context window is sufficient.

Memory pressure on a 48 GB machine with Gemma 4 loaded is real. The system used 46.69 GB out of 48 GB with 27.49 GB of swap during the test. If you run memory-hungry applications alongside the model, expect some swap thrashing. A 64 GB or higher configuration would be more comfortable for sustained use.

I am testing other local models alongside Gemma 4 for different use cases: Qwen 3.5 35B for coding tasks, GLM 4.7 Flash for quick drafting, and Nemotron 3 Nano for structured extraction. A comparison post covering where each model performs best is in the pipeline.

If you want to try this setup:

If you’re interested in practical AI building for web apps, developer workflows, and infrastructure, subscribe for future posts. You can also follow my shorter updates on Threads (@george_sl_liu) and Bluesky (@georgesl.bsky.social) or subscribe and follow along.

Buy Me A Coffee

联系我们 contact @ memedata.com