Pool spare GPU capacity to run LLMs at larger scale. Models that don't fit on one machine are automatically distributed — dense models via pipeline parallelism, MoE models via expert sharding with zero cross-node inference traffic. Have your agents gossip across the mesh — share status, findings, and questions without a central server.
Try it now — live console connected to a public mesh. Chat with models running on real hardware.
curl -fsSL https://github.com/michaelneale/mesh-llm/releases/latest/download/mesh-bundle.tar.gz | tar xz && mv mesh-bundle/* ~/.local/bin/Then run:
mesh-llm --auto # join the best public mesh, start servingThat's it. Downloads a model for your hardware, connects to other nodes, and gives you an OpenAI-compatible API at http://localhost:9337.
Or start your own:
mesh-llm --model Qwen2.5-32B # downloads model (~20GB), starts API + web console
mesh-llm --model Qwen2.5-3B # or a small model first (~2GB)Add another machine:
mesh-llm --join <token> # token printed by the first machineOr discover and join public meshes:
mesh-llm --auto # find and join the best mesh
mesh-llm --client --auto # join as API-only client (no GPU)Every node gets an OpenAI-compatible API at http://localhost:9337/v1. Distribution is automatic — you just say mesh-llm --model X and the mesh figures out the best strategy:
- Model fits on one machine? → runs solo, full speed, no network overhead
- Dense model too big? → pipeline parallelism — layers split across nodes
- MoE model too big? → expert parallelism — experts split across nodes, zero cross-node traffic
If a node has enough VRAM, it always runs the full model. Splitting only happens when it has to.
Pipeline parallelism — for dense models that don't fit on one machine, layers are distributed across nodes proportional to VRAM. llama-server runs on the highest-VRAM node and coordinates via RPC. Each rpc-server loads only its assigned layers from local disk. Latency-aware: peers are selected by lowest RTT first, with an 80ms hard cap — high-latency nodes stay in the mesh as API clients but don't participate in splits.
MoE expert parallelism — Mixture-of-Experts models (Qwen3-MoE, GLM, OLMoE, Mixtral, DeepSeek — increasingly the best-performing architectures) are auto-detected from the GGUF header. The mesh reads expert routing statistics to identify which experts matter most, then assigns each node an overlapping shard: a shared core of critical experts replicated everywhere, plus unique experts distributed across nodes. Each node gets a standalone GGUF with the full trunk + its expert subset and runs its own independent llama-server — zero cross-node traffic during inference. Sessions are hash-routed to nodes for KV cache locality.
Multi-model — different nodes serve different models simultaneously. The API proxy peeks at the model field in each request and routes to the right node via QUIC tunnel. /v1/models lists everything available.
Demand-aware rebalancing — a unified demand map tracks which models the mesh wants (from --model flags, API requests, and gossip). Demand signals propagate infectiously across all nodes and decay naturally via TTL. Standby nodes auto-promote to serve unserved models with active demand, or rebalance when one model is significantly hotter than others. When a model loses its last server, standby nodes detect it within ~60s.
Latency design — the key insight is that HTTP streaming is latency-tolerant while RPC is latency-multiplied. llama-server always runs on the same box as the GPU. The mesh tunnels HTTP, so cross-network latency only affects time-to-first-token, not per-token throughput. RPC only crosses the network for pipeline splits where the model physically doesn't fit on one machine.
- Zero-transfer GGUF loading —
SET_TENSOR_GGUFtells rpc-server to read weights from local disk. Dropped model load from 111s → 5s. - RPC round-trip reduction — cached
get_alloc_size, skip GGUF lookups for intermediates. Per-token round-trips: 558 → 8. - Direct server-to-server transfers — intermediate tensors pushed directly between rpc-servers via TCP, not relayed through the client.
- Speculative decoding — draft model runs locally on the host, proposes tokens verified in one batched forward pass. +38% throughput on code (75% acceptance).
mesh-llm --model Qwen2.5-32BStarts serving a model and prints an invite token. This mesh is private — only people you share the token with can join.
To make it public (discoverable by others via --auto):
mesh-llm --model Qwen2.5-32B --publishmesh-llm --join <token> # join with invite token (GPU node)
mesh-llm --client --join <token> # join as API-only client (no GPU)mesh-llm --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night"Everyone runs the same command. First person creates it, everyone else discovers "poker-night" and joins automatically. --mesh-name implies --publish — named meshes are always published to the directory.
mesh-llm --auto # discover, join, and serve a model
mesh-llm --client --auto # join as API-only client (no GPU)
mesh-llm discover # browse available meshesmesh-llm --model Qwen2.5-32B --model GLM-4.7-Flash
# Route by model name
curl localhost:9337/v1/chat/completions -d '{"model":"GLM-4.7-Flash-Q4_K_M", ...}'Different nodes serve different models. The API proxy routes by the model field.
mesh-llm # no args — shows instructions + consoleOpens a read-only console on :3131. Use the CLI to start or join a mesh.
mesh-llm --model Qwen2.5-32B # dashboard at http://localhost:3131Live topology, VRAM bars per node, model picker, built-in chat. Everything comes from /api/status (JSON) and /api/events (SSE).
Build-from-source and UI development instructions are in CONTRIBUTING.md.
mesh-llm exposes an OpenAI-compatible API on localhost:9337. Any tool that supports custom OpenAI endpoints works. /v1/models lists available models; the model field in requests routes to the right node.
For built-in launcher integrations (goose, claude):
- If a mesh is already running locally on
--port, it is reused. - If not,
mesh-llmauto-starts a background client node that auto-joins the mesh. - If
--modelis omitted, the launcher picks the strongest tool-capable model available on the mesh. - When the harness exits (e.g.
claudequits), the auto-started node is cleaned up automatically.
Goose is available as both CLI (goose session) and desktop app (Goose.app).
Use a specific model (example: MiniMax):
mesh-llm goose --model MiniMax-M2.5-Q4_K_MThis command writes/updates ~/.config/goose/custom_providers/mesh.json and launches Goose.
- Start a mesh client:
mesh-llm --client --auto --port 9337- Check what models are available:
curl -s http://localhost:9337/v1/models | jq '.data[].id'- Add a
meshprovider to~/.pi/agent/models.json(adjust model IDs to match your mesh):
{
"providers": {
"mesh": {
"api": "openai-completions",
"apiKey": "mesh",
"baseUrl": "http://localhost:9337/v1",
"models": [
{
"id": "MiniMax-M2.5-Q4_K_M",
"name": "MiniMax M2.5 (Mesh)",
"contextWindow": 65536,
"maxTokens": 8192,
"reasoning": true,
"input": ["text"],
"compat": {
"maxTokensField": "max_tokens",
"supportsDeveloperRole": false,
"supportsUsageInStreaming": false
}
}
]
}
}
}- Run pi:
pi --model mesh/MiniMax-M2.5-Q4_K_MOr switch models interactively with Ctrl+M inside pi.
OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:9337/v1 opencode -m openai/GLM-4.7-Flash-Q4_K_MClaude Code can be launched directly through mesh-llm (no proxy required):
Use a specific model (example: MiniMax):
mesh-llm claude --model MiniMax-M2.5-Q4_K_Mcurl http://localhost:9337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'The mesh doesn't just share compute — it shares knowledge. Agents and people post status updates, findings, and questions to a shared blackboard that propagates across the mesh.
Works standalone — you don't need to run models through the mesh. Using your own API keys or a cloud provider? Just run mesh-llm --client --blackboard to give your agents a gossip layer. No GPU needed, no model needed.
# Enable on any node (with or without a model)
mesh-llm --client --blackboard
# Install the agent skill (works with pi, Goose, others)
mesh-llm blackboard install-skill
# Post what you're working on
mesh-llm blackboard "STATUS: [org/repo branch:main] refactoring billing module"
# Search the blackboard
mesh-llm blackboard --search "billing refactor"
# Check for unanswered questions
mesh-llm blackboard --search "QUESTION"With the skill installed, agents proactively search before starting work, post their status, share findings, and answer each other's questions — all through the mesh.
Messages are ephemeral (48h), PII is auto-scrubbed, and everything stays within the mesh — no cloud, no external services.
The blackboard is available as an MCP server for agent integration. Any MCP-compatible agent (pi, Claude Code, Goose, etc.) can post, search, and read the feed directly:
# Run as MCP server over stdio
mesh-llm blackboard --mcpConfigure in your agent's MCP settings:
{
"mcpServers": {
"mesh-blackboard": {
"command": "mesh-llm",
"args": ["blackboard", "--mcp"]
}
}
}Tools exposed: blackboard_post, blackboard_search, blackboard_feed.
GLM-4.7-Flash-Q4_K_M (17GB), M4 Max + Mac Mini M4, WiFi:
| Configuration | tok/s |
|---|---|
| Solo (no mesh) | 68 |
| 2-node split (85/15) | 21 |
| 3-node split (62/31/8) | 12-13 |
Cross-network (Sydney ↔ Queensland, ~20ms RTT): 10-25 tok/s. Overhead dominated by per-token RPC latency.
Stock llama.cpp RPC transfers 16.88GB on connect. This fork: 0 bytes, ~9 seconds.
mesh-llm download # list models
mesh-llm download 32b # Qwen2.5-32B (~20GB)
mesh-llm download 72b --draft # Qwen2.5-72B + draft modelDraft pairings for speculative decoding:
| Model | Size | Draft | Draft size |
|---|---|---|---|
| Qwen2.5 (3B/7B/14B/32B/72B) | 2-47GB | Qwen2.5-0.5B | 491MB |
| Qwen3-32B | 20GB | Qwen3-0.6B | 397MB |
| Llama-3.3-70B | 43GB | Llama-3.2-1B | 760MB |
| Gemma-3-27B | 17GB | Gemma-3-1B | 780MB |
--model accepts several formats. Models are auto-downloaded to ~/.models/ on first use.
# Catalog name (fuzzy match — finds Qwen3-8B-Q4_K_M)
mesh-llm --model Qwen3-8B
# Full catalog name
mesh-llm --model Qwen3-8B-Q4_K_M
# HuggingFace URL (any GGUF)
mesh-llm --model https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# HuggingFace shorthand (org/repo/file.gguf)
mesh-llm --model bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Local file path
mesh-llm --model ~/my-models/custom-model.ggufCatalog models are downloaded with resume support — if a download is interrupted, it picks up where it left off. Use mesh-llm download to browse the catalog.
mesh-llm [OPTIONS]
--model NAME|PATH|URL Model to serve (can specify multiple)
--join TOKEN Join mesh via invite token
--auto Discover and join via directory
--client API-only client (no GPU)
--blackboard Enable the blackboard (works on any node)
--name NAME Display name on the blackboard (default: $USER)
--mesh-name NAME Name the mesh (implies --publish)
--publish Publish mesh to directory
--region REGION Geographic region tag (AU, US-West, EU-West, ...)
--max-clients N Delist when N clients connected
--port PORT API port (default: 9337)
--console PORT Console port (default: 3131)
--bind-port PORT Pin QUIC to fixed UDP port (for NAT)
--listen-all Bind to 0.0.0.0 (for containers)
--max-vram GB Cap VRAM advertised to mesh
--split Force pipeline split (dense) or MoE expert split
--device DEV GPU device (default: MTL0)
--draft PATH Draft model for speculative decoding
--no-draft Disable auto draft detection
mesh-llm download [NAME] [--draft]
mesh-llm discover [--model M] [--region R] [--auto]
mesh-llm drop <model>
mesh-llm rotate-key
mesh-llm blackboard [TEXT] [--search Q] [--from NAME] [--since HOURS]
mesh-llm blackboard --mcp Run as MCP server (stdio) for agents
mesh-llm blackboard install-skill
just bundle # creates /tmp/mesh-bundle.tar.gz
scp /tmp/mesh-bundle.tar.gz user@remote:
ssh user@remote 'tar xzf mesh-bundle.tar.gz && mesh-bundle/mesh-llm --model Qwen2.5-3B'Same architecture required (arm64 macOS → arm64 macOS). Bundle includes mesh-llm + llama.cpp binaries. For WAN: forward --bind-port UDP on the router — only the originator needs it.
See CONTRIBUTING.md for build and development workflows.
| Path | Purpose |
|---|---|
llama.cpp/ |
Fork with zero-transfer RPC patches |
mesh-llm/ |
Rust QUIC mesh (internals) |
