在苹果神经引擎 (ANE) 上运行大型语言模型 (LLM)

在苹果神经引擎 (ANE) 上运行大型语言模型 (LLM)
Run LLMs on Apple Neural Engine (ANE)

ANEMLL是一个开源项目，旨在简化大型语言模型（LLM）在苹果神经引擎（ANE）上的部署。该项目提供工具，用于从Hugging Face转换模型，优化的Swift推理代码，以及用于设备上LLM推理的iOS/macOS示例应用程序。当前的Alpha版本0.3.0专注于LLaMA 3.1模型，包括DeepSeek和DeepHermes蒸馏版本。它提供LLM转换工具、Swift参考实现、示例应用程序（可在TestFlight上获得）、CLI应用程序和基准测试工具（ANEMLL-BENCH）。用户可以从ANEMLL Hugging Face仓库下载预转换的模型。该库需要macOS Sequoia及苹果神经引擎，至少16GB内存，Python 3.9和Xcode命令行工具。安装包括创建虚拟环境，从`requirements.txt`安装依赖项，以及验证CoreML编译器的安装。ANEMLL欢迎贡献，并采用MIT许可证。

Hacker News上的一篇讨论围绕着在苹果的神经引擎（ANE）上运行大型语言模型（LLM）。原帖作者质疑为什么苹果的MLX和llama.cpp没有完全支持ANE，尽管它有可能提高速度和内存效率。用户们讨论了ANE对于现代量化LLM的局限性，指出它专注于FP16/INT8运算，并且可能存在内存带宽瓶颈。虽然ANE可以改进提示预处理并降低功耗，但由于其“宽度”限制，其吞吐量可能不如GPU。讨论涵盖了ANE的实际应用案例，例如与GPU密集型任务一起进行图像分类。一些用户声称M3 Ultra在LLM推理方面优于高端Nvidia GPU，而另一些用户则对此表示异议。Mac上的统一内存允许运行超过Nvidia消费级显卡内存限制的更大模型。然而，苹果对ANE访问的严格控制受到了批评。文中提到了其他替代方案，例如AMD的Strix Halo，它也具有统一内存，但也面临限制。AneMll在M4 Max上的实际优势得到了体现，虽然令牌生成速度较慢，但内存使用量显著减少。

（评论） 2024-09-16

（评论） 2024-07-17

（评论） 2024-09-23

Show HN：LocalScore – 本地LLM基准测试 2025-04-06

原文

ANEMLL (pronounced like "animal") is an open-source project focused on accelerating the porting of Large Language Models (LLMs) to tensor processors, starting with the Apple Neural Engine (ANE).

The goal is to provide a fully open-source pipeline from model conversion to inference for common LLM architectures running on ANE. This enables seamless integration and on-device inference for low-power applications on edge devices, ensuring maximum privacy and security. This is critical for autonomous applications, where models run directly on the device without requiring an internet connection.

We aim to:

Provide flexible and easy to use library/framework to port LLMs to ANE directly from Hugging Face models

Provide on-device examples for iOS and macOS swift or C/C++ Applications

See update Roadmap.md for more details

Main Components in 0.3.0 Alpha Release

ANEMLL provides five main components for Apple Neural Engine inference development:

LLM Conversion Tools - Scripts and code to convert models directly from Hugging Face weights
Swift Reference Implementation - Optimized inference code for Swift applications
- Sample CLI application in anemll-swift-cli
- Core inference engine implementation
Python Sample Code - Reference implementation and testing tools
- Basic chat interface (chat.py)
- Advanced conversation management (chat_full.py)
iOS/macOS Sample Applications - Ready-to-use example applications (Alpha, now on TestFlight)
- SwiftUI Chat interface
- Model Downloads and integration example
- Conversation management
ANEMLL-BENCH - Apple Neural Engine Benchmarking
- Performance testing and comparison
- Model optimization metrics
- Hardware-specific benchmarks
- GitHub Repository

We provide sample converted models ready for use:

LLAMA 3.1 (1B and 8B variants) including iOS "friendly builds"
DeepSeek distilled models
DeepHermes distilled models

Note

Please note that Quantization should be improved. LUT4 quality is fairly low due to lack of Block Quantization on Apple Neural Engine. Some GPTQ and Spin Quant should greatly improve LUT4 models.

Visit our Hugging Face repository for the latest converted models.

Important

This is Alpha Release 0.3.0 for the library. It is designed to process Model Weights directly from Hugging Face models and convert them to the CoreML format for Apple Neural Engine (ANE for short). This is Alpha Release 0.3.0 for the library. It is designed to process Model Weights directly from Hugging Face models and convert them to the CoreML format for Apple Neural Engine (ANE for short).

This release only supports LLAMA models including DeepSeek and DeepHermes distilled models on LLaMA 3.1 architecture
The future release will add support for more models and architectures
Please visit https://huggingface.co/anemll where we upload the latest models and X: @anemll for updates
Please star this repo to support the project!

Swift UI Sample Code

Sample iOS/macOS inference Chat-Bot App (Alpha)
Updates to Model conversion and upload scripts
Updates to Swift Package and CLI App

Sample iOS/macOS Applications

Downloads reference or custom models from HuggingFace
Inference / chat implementation use Swift Library
Sample TestFlight App for a quick test
See iOS/macOS Sample Applications Guide for details

Swift CLI Reference Implementation

The Swift CLI provides a reference implementation for running models on Apple Neural Engine. For detailed documentation, see Swift CLI Guide.

Download a model from Hugging Face
Convert the model using our single-shot conversion script:

./anemll/utils/convert_model.sh --model <path_to_model> --output <output_directory>

Run the model using our sample code:

python ./tests/chat.py --meta <output_directory>/meta.yaml

For detailed conversion steps and advanced options, see:

We provide two chat interfaces:

chat.py - Basic chat interface for quick testing
chat_full.py - Advanced chat with conversation history management

Features of chat_full.py:

Maintains full conversation history within context window
Automatically truncates older messages when needed
Shifts context window dynamically during long responses
Shows generation speed and token statistics
Better handles multi-turn conversations

Example running Chats:

# Basic chat
python ./tests/chat.py --meta ./converted_models/meta.yaml

# Full conversation mode
python ./tests/chat_full.py --meta ./converted_models/meta.yaml

See chat.md for more details

[Note] The first time the model loads, macOS will take some time to place it on the device. Subsequent loads will be instantaneous. Use Ctrl-D to exit, Ctrl-C to interrupt inference.

macOS Sequoia with Apple Neural Engine
Minimum 16GB RAM
Python 3.9

Install ANEMLL: We recommend creating a new virtual environment for this project.

python -m venv anemll-env
source anemll-env/bin/activate
pip install -r requirements.txt
# pip install anemll
# due to Alpha Release, we do not recommend installing ANEMLL as a package yet

CoreML compiler is required to compile the model. It is part of the Xcode command line tools.

Ensure that Xcode Command Line Tools are installed, as they include coremlcompiler.
You can install them by running xcode-select --install.
Verify that the xcrun command is available and correctly configured in your PATH.
Use xcrun --find coremlcompiler to verify the installation.
If above fails, please try following steps:
Download Xcode from the App Store.
Run sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer/ to set the path.
Use xcrun --find coremlcompiler to verify the installation.
Run sudo xcodebuild -license and agree to the license.

Currently optimized for:

Meta's LLaMA 3.2 1B and 8B (1024 context) model including DeepSeek R1 8B distilled model, DeepHermes 3B and 8B models
More models are coming soon

Inspirations, feedback and other resources

Note

We welcome contributions! Please read our contributing guidelines before submitting PRs.

Feel free to submit issues and pull requests to improve ANEMLL!

Note

If you're using ANEMLL in your project, please submit a PR to add it to this list. We love to showcase how the community is using ANEMLL!

Third-Party Applications Using ANEMLL

Note

If you're using ANEMLL in your project, please submit a PR to add it to this list. We love to showcase how the community is using ANEMLL!

For examples of how to integrate ANEMLL into your projects, see:

For any questions or support, reach out to us at [email protected]

ANEMLL is licensed under the MIT License. https://opensource.org/license/mit