M4 苹果神经引擎内部,第一部分:逆向工程
Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering

原始链接: https://maderix.substack.com/p/inside-the-m4-apple-neural-engine

## 逆向工程苹果神经引擎:摘要(第一部分) 本系列详细介绍了一位人类研究员(maderix)和AI Claude Opus 4.6合作逆向工程苹果神经引擎(ANE)的过程——这是M4芯片上一个专门的机器学习加速器。苹果有意模糊对ANE的访问,迫使开发者使用抽象的CoreML框架。 该团队成功绕过了CoreML,将软件堆栈映射到内核驱动程序,破解了二进制格式,并实现了对ANE的直接访问。他们发现ANE不是传统的GPU或CPU,而是一个图执行引擎,针对运行编译后的神经网络图作为单个操作进行了优化。 主要发现包括16核设计,深度队列(127个请求),独立的电源管理,以及用于程序编译成高效E5二进制文件的独特机器学习中间语言(MIL)。这些二进制文件参数化了固定的计算基元,如卷积和矩阵乘法,而不是编码算法本身。 该团队的代码可在GitHub上获取,它解锁了直接ANE编程的潜力,并为未来在该先前专注于推理的硬件上进行基准测试(第二部分)甚至训练(第三部分)奠定了基础。

## 苹果M4神经引擎:深度解析 最近一项逆向工程工作,详情见[maderix.substack.com](maderix.substack.com),深入研究了苹果M4神经引擎(ANE)的内部运作。该项目由人类和AI Claude Opus合作完成,旨在理解和优化该专用硬件的性能。 讨论强调了苹果故意使芯片逆向工程变得困难,其方法超越了简单的代码剥离。值得注意的是,一位具有Xcode团队经验的开发者证实了苹果隐藏ANE功能的努力。 一位用户成功地将NanoGPT训练的一部分卸载到ANE上,实现了显著的加速(分类器加速10倍,Softmax加速34倍),并解决了内存问题。然而,关于苹果声称的性能指标(38 TOPS)以及它们是否具有误导性的行业惯例,仍然存在疑问。 ANE不仅仅用于未来的“Apple Intelligence”功能;它已经为现有的功能提供支持,例如设备上的OCR、Pro应用中的图像处理,甚至FaceID。虽然目前在开源软件支持方面受到限制,但更广泛的应用潜力,特别是与MLX等框架一起使用,正在被探索。ANE封闭的性质和缺乏现成的源代码,即使对于苹果自己的团队来说,也是一个关键挑战。这篇文章还引发了关于AI在技术分析中的作用以及“LLMisms”(大型语言模型特有的表达方式)可能渗入写作的争论。
相关文章

原文

A note on “we”:

Throughout this series, “we” refers to maderix (human) and Claude Opus 4.6 (by Anthropic) working as a pair. The reverse engineering, benchmarking, and training code were developed collaboratively — human intuition driving the exploration, AI reasoning through the data and writing the analysis. We think this kind of human–AI collaboration is a new and natural way to do systems research: one partner as the architect with intuition, the other as the engineer writing the code and crafting experiments .

This whole thing started with a simple question: can you train a model on Apple’s Neural Engine?

Apple doesn’t want you to know the answer. They don’t publish the ANE’s ISA. They don’t document its internal architecture. They don’t even give you a way to program it directly — everything goes through CoreML, which adds layers of abstraction, optimization passes, and overhead that make it nearly impossible to understand what the hardware is actually doing.

So we reverse-engineered it.

Over several days, we mapped the entire software stack from CoreML down to the IOKit kernel driver, discovered how to compile and execute programs on the ANE without CoreML, cracked the binary format, measured the true peak performance (spoiler: Apple’s “38 TOPS” number is misleading), and ultimately got a neural network training on a chip designed exclusively for inference.

This is Part 1 of a three-part series. Here we cover the reverse engineering — how we peeled back the layers to understand what the M4 Neural Engine actually is and how to talk to it directly.

The ANE is not a GPU. It’s not a CPU. It’s a graph execution engine — a fixed-function accelerator that takes a compiled neural network graph and executes the entire thing as one atomic operation. You don’t issue individual multiply-accumulate instructions. You submit a compiled program describing an entire computation graph, and the hardware executes it end-to-end.

Apple introduced the Neural Engine in the A11 (2017) as a 2-core design. Each generation has scaled it up:

The M4’s ANE (codename H16G) is what we’re working with. 16 cores, a queue depth of 127 evaluation requests, independent DVFS (dynamic voltage/frequency scaling), and hard power gating that drops it to exactly 0 milliwatts when idle.

We weren’t the first to poke at ANE internals.

  • hollance/neural-engine — Matthijs Hollemans’ comprehensive community documentation of ANE behavior, performance characteristics, and supported operations. The single best existing resource on ANE.

  • mdaiter/ane — Early reverse engineering with working Python and Objective-C samples, documenting the ANECompiler framework and IOKit dispatch.

  • eiln/ane — A reverse-engineered Linux driver for ANE (Asahi Linux project), providing insight into the kernel-level interface.

  • apple/ml-ane-transformers — Apple’s own reference implementation of transformers optimized for ANE, confirming design patterns like channel-first layout and 1×1 conv preference.

But to our knowledge, nobody had previously: (a) achieved direct _ANEClient API access without CoreML on M4, (b) cracked the in-memory MIL compilation path, (c) measured true peak throughput bypassing CoreML overhead, or (d) trained a model on ANE.

Our approach combined several techniques:

  1. Class discovery via dyld_info -objc on AppleNeuralEngine.framework — this dumps every Objective-C class and method

  2. Method swizzling to intercept CoreML’s calls to the private ANE frameworks

  3. Binary analysis of compiled E5 bundles to understand the neural program format

  4. Scaling analysis — varying matrix sizes, graph depths, and channel counts to infer hardware topology

We discovered 40+ private classes in AppleNeuralEngine.framework, including _ANEClient, _ANEModel, _ANERequest, _ANEIOSurfaceObject, _ANEInMemoryModel, and many more.

Here’s what the full ANE software stack looks like, from the public CoreML API down to hardware:

The key insight: CoreML is not the only way in. The _ANEClient class in AppleNeuralEngine.framework provides direct access to the compile → load → evaluate pipeline. CoreML is just a convenience layer on top.

Here’s the complete sequence to compile and run a program on ANE without CoreML:

The I/O uses IOSurfaces — the same shared memory mechanism used for GPU textures. This means zero-copy transfers between GPU and ANE are theoretically possible if you share the same IOSurfaceRef.

Key finding: The ANE supports a queue depth of 127 — you can have up to 127 evaluation requests in-flight simultaneously. This is far deeper than most accelerator queues and suggests the hardware is designed for high-throughput streaming inference.

CoreML doesn’t send neural networks to ANE in ONNX or protobuf format. It uses MIL — Machine Learning Intermediate Language — a typed SSA (Static Single Assignment) representation that looks like this:

MIL is surprisingly readable. Every value is typed with both precision and shape. Operations are named and take keyword arguments. The function signature declares input tensors with explicit dimensions.

The tensor layout follows ANE’s native NCDHW + Interleave format: [Batch, Channels, Depth, Height, Width]. For a 1024×1024 matrix, this becomes [1, 1024, 1, 1024] in 4D.

When ANECompiler processes a MIL program, it produces an E5 binary — a FlatBuffer-structured file with these sections:

Here’s the fascinating part: a 1024×1024 matmul compiles to 2,688 bytes. A 128×128 matmul compiles to 2,680 bytes. Nearly identical. The E5 binary isn’t encoding the matrix multiplication algorithm — it’s encoding a parameterized program whose behavior is controlled by tensor descriptors at runtime. The “microcode” is more like a configuration than traditional machine code.

Implication: The ANE hardware likely has a small set of fixed compute primitives (convolution, matrix multiply, elementwise) that are parameterized by tensor shape descriptors. The E5 binary describes which primitives to chain and how to connect them, not the compute itself.

The file-based compilation path works but has a problem: it requires writing MIL text to disk, creating a directory structure, and pointing the compiler at it. For training — where we need to recompile with updated weights every few steps — this filesystem round-trip is unacceptable.

We discovered _ANEInMemoryModelDescriptor, which accepts MIL text directly in memory:

Getting this to work required solving three gotchas that cost us days of debugging:

  1. NSData, not NSString: The milText parameter wants an NSData* containing UTF-8 bytes, not an NSString*. Passing a string fails silently.

  2. NSDictionary, not NSData: The weights parameter is a dictionary mapping weight names to NSData blobs, not a single data buffer.

  3. Temp directory workaround: Even the “in-memory” path internally writes to a temp directory. If you don’t have write access to the default location, compilation fails with an opaque error. We had to ensure a writable temp path was available.

And one delightful discovery: Apple’s internal code references a Desctiptor (sic) in one of the class names. Even Apple engineers make typos in private APIs. :)

Through IOKit probing, scaling analysis, and power measurement, we’ve built this profile of the M4 ANE:

IOKit’s IOReportLegend reveals the ANE has its own independent power management with adaptive clocking, dithering, and multiple hardware/software triggers:

This level of DVFS sophistication suggests the ANE can independently scale its frequency and voltage based on workload characteristics, separate from the CPU and GPU power domains.

From ANECompiler.framework exports, the ANE natively supports:

Notably, Conv appears to be the ANE’s primary compute primitive. As we’ll show in Part 2, expressing matmul as 1×1 convolution unlocks significantly higher throughput.

All data transfer to and from the ANE uses IOSurfaces. The protocol is straightforward:

Since IOSurfaces are the same mechanism used for GPU texture sharing, this opens up the possibility of zero-copy GPU↔ANE pipelines where both accelerators operate on the same memory.

The ANE compiler caches E5 binaries on disk to avoid recompilation:

First compile takes ~20-40ms. Cache hits are effectively free. This matters for inference (compile once, run forever) but creates challenges for training, where weights change every step.

Several discovered classes remain unexplored and hint at capabilities we haven’t tested:

  • _ANEChainingRequest — may enable chaining multiple compiled models in a single dispatch

  • _ANESharedEvents / _ANESharedSignalEvent / _ANESharedWaitEvent — Metal-style fence/signal primitives for GPU↔ANE synchronization

  • _ANEPerformanceStats — possibly hardware performance counters

  • _ANEVirtualClient — virtualized ANE access, potentially for multi-process sharing

And some things we genuinely don’t know:

  • The exact ANE core microarchitecture and ISA

  • How cores are assigned to operations within a graph

  • The ANE clock frequency (DVFS makes this dynamic)

  • Whether hardware perf counters are accessible

  • The exact SRAM topology (banked? unified? per-core?)

Now that we have direct access to the ANE, we can actually measure what it can do. In Part 2, we’ll benchmark everything: matmul scaling, the SRAM performance cliff, why convolution is 3× faster than matmul, why Apple’s “38 TOPS” claim is misleading, and how bypassing CoreML gives you 2-4× more throughput.

In Part 3, we’ll do the thing Apple says you can’t: train a neural network on the Neural Engine.

All code is available at github.com/maderix/ANE in the ane/ directory. Tested on M4 Mac Mini, macOS 15.x.

联系我们 contact @ memedata.com