英伟达DynamO:一个数据中心规模的分布式推理服务框架
Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework

原始链接: https://github.com/ai-dynamo/dynamo

NVIDIA Dynamo是一个开源的高吞吐量、低延迟推理框架,针对在分布式环境中服务生成式AI和推理模型进行了优化。它与引擎无关,支持TRT-LLM、vLLM和SGLang等后端,并以性能(Rust)和可扩展性(Python)为设计目标。 Dynamo专注于最大化GPU吞吐量,并通过诸如解耦预填充和解码、动态GPU调度、LLM感知请求路由、加速数据传输和KV缓存卸载等功能来管理延迟和吞吐量之间的权衡。 其主要组件包括一个与OpenAI兼容的前端(Rust)、一个基本和KV感知路由器以及预配置的LLM服务引擎(工作器)。您可以使用`dynamo run`轻松运行本地模型,或者使用Docker Compose部署一个最小化的分布式设置,通过带有轮询路由器的HTTP服务模型。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Nvidia Dynamo:一个数据中心规模的分布式推理服务框架 (github.com/ai-dynamo) 25 分,来自 ashvardanian,45 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论 Carrok 10 分钟前 [–] 作为一个花了大部分时间尝试让各种 Nvidia 推理产品运行起来的人,即使直接联系了他们的开发者,我也只想说“当心”。 回复 加入我们 6 月 16-17 日在旧金山举办的 AI 初创公司学校! 指导原则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们 搜索:

原文

License GitHub Release Discord

| Guides | Architecture and Features | APIs | SDK |

NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities such as:

  • Disaggregated prefill & decode inference – Maximizes GPU throughput and facilitates trade off between throughput and latency.
  • Dynamic GPU scheduling – Optimizes performance based on fluctuating demand
  • LLM-aware request routing – Eliminates unnecessary KV cache re-computation
  • Accelerated data transfer – Reduces inference response time using NIXL.
  • KV cache offloading – Leverages multiple memory hierarchies for higher system throughput

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

The following examples require a few system level packages.

apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev libucx0

pip install ai-dynamo[all]

Note

TensorRT-LLM Support is currently available on a branch

Running and Interacting with an LLM Locally

To run a model and interact with it locally you can call dynamo run with a hugging face model. dynamo run supports several backends including: mistralrs, sglang, vllm, and tensorrtllm.

dynamo run out=vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
? User › Hello, how are you?
✔ User · Hello, how are you?
Okay, so I'm trying to figure out how to respond to the user's greeting. They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking." Hmm, I need to come up with a suitable reply. ...

Dynamo provides a simple way to spin up a local set of inference components including:

  • OpenAI Compatible Frontend – High performance OpenAI compatible http api server written in Rust.
  • Basic and Kv Aware Router – Route and load balance traffic to a set of workers.
  • Workers – Set of pre-configured LLM serving engines.

To run a minimal configuration you can use a pre-configured example.

Start Dynamo Distributed Runtime Services

First start the Dynamo Distributed Runtime services:

docker compose -f deploy/docker-compose.yml up -d

Start Dynamo LLM Serving Components

Next serve a minimal configuration with an http server, basic round-robin router, and a single worker.

cd examples/llm
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":false,
    "max_tokens": 300
  }' | jq
联系我们 contact @ memedata.com