展示 HN：ZSE – 开源 LLM 推理引擎，冷启动 3.9 秒

展示 HN：ZSE – 开源 LLM 推理引擎，冷启动 3.9 秒
Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

## ZSE：超高效LLM推理引擎 ZSE是一个高性能、内存高效的大语言模型（LLM）推理引擎。它显著降低了内存占用——能够让像70B参数的模型在24GB GPU上运行——同时保持速度。主要创新包括**zAttention**（优化的CUDA内核用于注意力机制）、**zQuantize**（INT2-8混合精度量化）、**zKV**（量化的KV缓存，具有滑动精度，节省4倍内存）和**zStream**（层流式传输，具有异步预取）。一个中央的**智能编排器**会根据*可用*内存智能地推荐配置。 ZSE提供多种效率模式（速度、平衡、内存、超高），并支持各种模型（Qwen、Llama、Mistral等），格式包括HuggingFace、safetensors、GGUF和其自身优化的`.zse`格式。转换为`.zse`格式速度很快（约20秒），并能带来显著的加速效果——对于Qwen 7B，加速高达11.6倍。它还提供与OpenAI兼容的API，方便集成。 ZSE可以通过pip安装，从GitHub源代码获取，以及作为CPU和GPU部署的Docker镜像使用。

## ZSE：快速高效的LLM推理引擎 Zyora Labs 发布了 ZSE (Z Server Engine)，一个开源的LLM推理引擎，旨在提高内存效率和加速冷启动。与传统方法不同，ZSE 显著降低了VRAM的使用量——32B模型可放入19.3GB（通常需要64GB），7B模型可放入5.2GB。其关键创新在于原生`.zse`格式，利用内存映射的预量化权重，消除了加载过程中的量化延迟。这带来了令人印象深刻的快速冷启动：**7B模型为3.9秒，32B模型为21.4秒**，大大优于bitsandbytes等替代方案。 ZSE 提供与OpenAI兼容的API、CLI、Web仪表盘、连续批处理以提高吞吐量，甚至包括CPU回退。它可以通过`pip install zllm-zse`安装，并采用Apache 2.0许可。开发者欢迎关于其设计和实现的提问。

原文

Ultra memory-efficient LLM inference engine.

ZSE is designed to run large language models with minimal memory footprint while maintaining high performance. Our key innovation is the Intelligence Orchestrator that provides smart recommendations based on your available (not total) memory.

🧠 zAttention: Custom CUDA kernels for paged, flash, and sparse attention
🗜️ zQuantize: Per-tensor INT2-8 mixed precision quantization
💾 zKV: Quantized KV cache with sliding precision (4x memory savings)
🌊 zStream: Layer streaming with async prefetch (run 70B on 24GB GPU)
🎯 zOrchestrator: Smart recommendations based on FREE memory
📊 Efficiency Modes: speed / balanced / memory / ultra

3.9s (7B) and 21.4s (32B) to first token with .zse format — verified on A100-80GB.

Model	bitsandbytes	ZSE (.zse)	Speedup
Qwen 7B	45.4s	3.9s	11.6×
Qwen 32B	120.0s	21.4s	5.6×

# One-time conversion (~20s)
zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse

# Every subsequent start: 3.9s
zse serve qwen-7b.zse

Note: Results measured on A100-80GB with NVMe storage (Feb 2026). On consumer SSDs expect 5-10s; HDDs may be slower. Any modern SSD achieves sub-10s cold starts.

Memory Benchmarks (Verified, A100-80GB)

Model	FP16	INT4/NF4	Reduction	Throughput
Qwen 7B	14.2 GB	5.2 GB	63% ✅	12-15 tok/s
Qwen 32B	~64 GB	19.3 GB (NF4) / ~35 GB (.zse)	70% ✅	7.9 tok/s
14B	~28 GB	~7 GB	⏳ est	-
70B	~140 GB	~24 GB	⏳ est	-

32B note: Use NF4 (19.3 GB) on GPUs with <36 GB VRAM. Use .zse (35 GB, 5.6× faster start) on 40 GB+ GPUs.

With CUDA support (recommended):

pip install zllm-zse[cuda]

From source:

git clone https://github.com/Zyora-Dev/zse.git
cd zse
pip install -e ".[dev]"

# Any HuggingFace model works!
zse serve Qwen/Qwen2.5-7B-Instruct
zse serve meta-llama/Llama-3.1-8B-Instruct
zse serve mistralai/Mistral-7B-Instruct-v0.3
zse serve microsoft/Phi-3-mini-4k-instruct
zse serve google/gemma-2-9b-it

# With memory optimization
zse serve Qwen/Qwen2.5-32B-Instruct --max-memory 24GB

# With recommendations
zse serve meta-llama/Llama-3.1-70B-Instruct --recommend

# Ultra memory efficiency
zse serve deepseek-ai/DeepSeek-V2-Lite --efficiency ultra

# GGUF models (via llama.cpp)
zse serve ./model-Q4_K_M.gguf

💡 Supported Models: Any HuggingFace transformers model, safetensors, GGUF, or .zse format. Popular choices: Qwen, Llama, Mistral, Phi, Gemma, DeepSeek, Yi, and more.

zse chat Qwen/Qwen2.5-7B-Instruct

zse convert Qwen/Qwen2.5-32B-Instruct -o qwen-32b.zse --target-memory 24GB

ZSE provides an OpenAI-compatible API:

zse serve Qwen/Qwen2.5-7B-Instruct --port 8000

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Mode	Description	Use Case
`speed`	Maximum throughput	Production with ample GPU memory
`balanced`	Good throughput, moderate memory	Standard deployment (default)
`memory`	Low memory, reduced throughput	Consumer GPUs
`ultra`	Extreme memory savings	4GB GPUs, laptops

zse serve model --efficiency memory

zse serve model --mode dev

No authentication required
SQLite database
Hot reload enabled
Debug logging

zse serve model --config configs/enterprise.yaml

API key authentication
PostgreSQL + Redis
Prometheus metrics
Rate limiting
Multi-tenancy

zse/
├── core/                   # ZSE Native Engine (100% custom)
│   ├── zattention/         # Custom attention kernels
│   ├── zquantize/          # Quantization (GPTQ, HQQ, INT2-8)
│   ├── zkv/                # Paged + quantized KV cache
│   ├── zstream/            # Layer streaming + prefetch
│   ├── zscheduler/         # Continuous batching
│   └── zdistributed/       # Tensor/pipeline parallelism
├── models/                 # Model loaders + architectures
├── engine/                 # Executor + Orchestrator
├── api/                    # CLI, FastAPI server, Web UI
└── enterprise/             # Auth, monitoring, scaling

GGUF models are supported via llama.cpp backend:

pip install zllm-zse[gguf]
zse serve ./model.gguf

Note: GGUF uses llama.cpp for inference. Native ZSE engine handles HuggingFace, safetensors, and .zse formats.

# CPU
docker run -p 8000:8000 ghcr.io/zyora-dev/zse:latest

# GPU (NVIDIA)
docker run --gpus all -p 8000:8000 ghcr.io/zyora-dev/zse:gpu

# With model pre-loaded
docker run -p 8000:8000 -e ZSE_MODEL=Qwen/Qwen2.5-0.5B-Instruct ghcr.io/zyora-dev/zse:latest

Docker Compose:

docker-compose up -d                    # CPU
docker-compose --profile gpu up -d      # GPU

See deploy/DEPLOY.md for full deployment guide including Runpod, Vast.ai, Railway, Render, and Kubernetes.

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=zse

# Type checking
mypy zse

# Linting
ruff check zse

Apache 2.0

PagedAttention concept from vLLM (UC Berkeley)
Flash Attention from Tri Dao
GPTQ, HQQ, and other quantization research

展示 HN：ZSE – 开源 LLM 推理引擎，冷启动 3.9 秒 Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Memory Benchmarks (Verified, A100-80GB)

展示 HN：ZSE – 开源 LLM 推理引擎，冷启动 3.9 秒
Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts