Executorch：适用于PyTorch的移动、嵌入式和边缘设备的AI。

Executorch：适用于PyTorch的移动、嵌入式和边缘设备的AI。
Executorch: On-device AI across mobile, embedded and edge for PyTorch

原始链接: https://github.com/pytorch/executorch

## ExecuTorch：使用 PyTorch 进行设备端 AI ExecuTorch 是 PyTorch 针对 AI 模型直接部署到设备（从智能手机到微控制器）的解决方案，优先考虑隐私、性能和可移植性。它在 Meta (Instagram, WhatsApp, Quest, Ray-Ban 智能眼镜) 内部得到广泛使用，允许使用熟悉的 PyTorch API 无缝部署 LLM、视觉、语音和多模态模型。主要特性包括直接从 PyTorch 导出 *无需* 中间格式转换，拥有 50KB 的微小运行时，并通过一次导出支持 12+ 硬件后端（Apple、Qualcomm、ARM 等）。它利用提前编译来优化模型以进行边缘部署，采用标准化的算子集和 CPU 回退。部署涉及导出、编译（具有量化选项）和执行生成的 `.pte` 文件。ExecuTorch 提供 C++、Swift (iOS) 和 Kotlin (Android) 的 SDK，以及用于 LLM 和多模态模型支持的工具（Llama 3, Llava, Voxtral）。高级功能包括量化、内存规划以及用于调试和优化的开发者工具。 ExecuTorch 采用 BSD 许可，并欢迎社区贡献。

黑客新闻新的 | 过去的 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Executorch：用于 PyTorch 的跨移动、嵌入式和边缘设备的 AI (github.com/pytorch) 8 分，由 klaussilveira 发表于 2 小时前 | 隐藏 | 过去的 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式搜索：

原文

On-device AI inference powered by PyTorch

ExecuTorch is PyTorch's unified solution for deploying AI models on-device—from smartphones to microcontrollers—built for privacy, performance, and portability. It powers Meta's on-device AI across Instagram, WhatsApp, Quest 3, Ray-Ban Meta Smart Glasses, and more.

Deploy LLMs, vision, speech, and multimodal models with the same PyTorch APIs you already know—accelerating research to production with seamless model export, optimization, and deployment. No manual C++ rewrites. No format conversions. No vendor lock-in.

📘 Table of Contents

🔒 Native PyTorch Export — Direct export from PyTorch. No .onnx, .tflite, or intermediate format conversions. Preserve model semantics.
⚡ Production-Proven — Powers billions of users at Meta with real-time on-device inference.
💾 Tiny Runtime — 50KB base footprint. Runs on microcontrollers to high-end smartphones.
🚀 12+ Hardware Backends — Open-source acceleration for Apple, Qualcomm, ARM, MediaTek, Vulkan, and more.
🎯 One Export, Multiple Backends — Switch hardware targets with a single line change. Deploy the same model everywhere.

ExecuTorch uses ahead-of-time (AOT) compilation to prepare PyTorch models for edge deployment:

🧩 Export — Capture your PyTorch model graph with torch.export()
⚙️ Compile — Quantize, optimize, and partition to hardware backends → .pte
🚀 Execute — Load .pte on-device via lightweight C++ runtime

Models use a standardized Core ATen operator set. Partitioners delegate subgraphs to specialized hardware (NPU/GPU) with CPU fallback.

Learn more: How ExecuTorch Works • Architecture Guide

For platform-specific setup (Android, iOS, embedded systems), see the Quick Start documentation for additional info.

Export and Deploy in 3 Steps

import torch
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

# 1. Export your PyTorch model
model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)
exported_program = torch.export.export(model, example_inputs)

# 2. Optimize for target hardware (switch backends with one line)
program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[XnnpackPartitioner()]  # CPU | CoreMLPartitioner() for iOS | QnnPartitioner() for Qualcomm
).to_executorch()

# 3. Save for deployment
with open("model.pte", "wb") as f:
    f.write(program.buffer)

# Test locally via ExecuTorch runtime's pybind API (optional)
from executorch.runtime import Runtime
runtime = Runtime.get()
method = runtime.load_program("model.pte").load_method("forward")
outputs = method.execute([torch.randn(1, 3, 224, 224)])

C++

#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>

Module module("model.pte");
auto tensor = make_tensor_ptr({2, 2}, {1.0f, 2.0f, 3.0f, 4.0f});
auto outputs = module.forward(tensor);

Swift (iOS)

import ExecuTorch

let module = Module(filePath: "model.pte")
let input = Tensor<Float>([1.0, 2.0, 3.0, 4.0], shape: [2, 2])
let outputs = try module.forward(input)

Kotlin (Android)

val module = Module.load("model.pte")
val inputTensor = Tensor.fromBlob(floatArrayOf(1.0f, 2.0f, 3.0f, 4.0f), longArrayOf(2, 2))
val outputs = module.forward(EValue.from(inputTensor))

Export Llama models using the export_llm script or Optimum-ExecuTorch:

# Using export_llm
python -m executorch.extension.llm.export.export_llm --model llama3_2 --output llama.pte

# Using Optimum-ExecuTorch
optimum-cli export executorch \
  --model meta-llama/Llama-3.2-1B \
  --task text-generation \
  --recipe xnnpack \
  --output_dir llama_model

Run on-device with the LLM runner API:

C++

#include <executorch/extension/llm/runner/text_llm_runner.h>

auto runner = create_llama_runner("llama.pte", "tiktoken.bin");
executorch::extension::llm::GenerationConfig config{
    .seq_len = 128, .temperature = 0.8f};
runner->generate("Hello, how are you?", config);

Swift (iOS)

import ExecuTorchLLM

let runner = TextRunner(modelPath: "llama.pte", tokenizerPath: "tiktoken.bin")
try runner.generate("Hello, how are you?", Config {
    $0.sequenceLength = 128
}) { token in
    print(token, terminator: "")
}

Kotlin (Android) — API Docs • Demo App

val llmModule = LlmModule("llama.pte", "tiktoken.bin", 0.8f)
llmModule.load()
llmModule.generate("Hello, how are you?", 128, object : LlmCallback {
    override fun onResult(result: String) { print(result) }
    override fun onStats(stats: String) { }
})

For multimodal models (vision, audio), use the MultiModal runner API which extends the LLM runner to handle image and audio inputs alongside text. See Llava and Voxtral examples.

See examples/models/llama for complete workflow including quantization, mobile deployment, and advanced options.

Next Steps:

Platform & Hardware Support

Platform	Supported Backends
Android	XNNPACK, Vulkan, Qualcomm, MediaTek, Samsung Exynos
iOS	XNNPACK, MPS, CoreML (Neural Engine)
Linux / Windows	XNNPACK, OpenVINO, CUDA (experimental)
macOS	XNNPACK, MPS, Metal (experimental)
Embedded / MCU	XNNPACK, ARM Ethos-U, NXP, Cadence DSP

See Backend Documentation for detailed hardware requirements and optimization guides. For desktop/laptop GPU inference with CUDA and Metal, see the Desktop Guide. For Zephyr RTOS integration, see the Zephyr Guide.

ExecuTorch powers on-device AI at scale across Meta's family of apps, VR/AR devices, and partner deployments. View success stories →

LLMs: Llama 3.2/3.1/3, Qwen 3, Phi-4-mini, LiquidAI LFM2

Multimodal: Llava (vision-language), Voxtral (audio-language), Gemma (vision-language)

Vision/Speech: MobileNetV2, DeepLabV3, Whisper

Resources: examples/ directory • executorch-examples out-of-tree demos • Optimum-ExecuTorch for HuggingFace models • Unsloth for fine-tuned LLM deployment

ExecuTorch provides advanced capabilities for production deployment:

Quantization — Built-in support via torchao for 8-bit, 4-bit, and dynamic quantization
Memory Planning — Optimize memory usage with ahead-of-time allocation strategies
Developer Tools — ETDump profiler, ETRecord inspector, and model debugger
Selective Build — Strip unused operators to minimize binary size
Custom Operators — Extend with domain-specific kernels
Dynamic Shapes — Support variable input sizes with bounded ranges

See Advanced Topics for quantization techniques, custom backends, and compiler passes.

We welcome contributions from the community!

ExecuTorch is BSD licensed, as found in the LICENSE file.