ONNX Runtime 和 CoreML 可能会静默地将你的模型转换为 FP16。
ONNX Runtime and CoreML May Silently Convert Your Model to FP16

原始链接: https://ym2132.github.io/ONNX_MLProgram_NN_exploration

在Mac上使用ONNX Runtime (ORT)中的CoreMLExecutionProvider运行ONNX模型,可能导致预测结果与PyTorch或CPU上的ORT相比出现意外差异,原因是隐式转换为FP16精度。ORT在转换为CoreML时,默认会将模型转换为FP16,这可能会改变结果,尤其是在决策阈值附近。 解决方法是在创建`InferenceSession`时,显式地将`ModelFormat`设置为“MLProgram”:`ort.InferenceSession(onnx_model_path, providers=[("CoreMLExecutionProvider", {"ModelFormat": "MLProgram"})])`。这将确保模型保持在FP32。 这种差异源于CoreML的两种模型格式:较旧的NeuralNetwork格式(ORT中的默认格式),它缺乏显式类型并默认在Apple GPU上使用FP16,以及较新的MLProgram格式,它通过更强大、类型化的中间表示来保留FP32精度。MLProgram利用CoreML的模型中间语言 (MIL) 进行优化,同时保持指定的数据类型。虽然FP16在Apple硬件上可能更快,但隐式转换会显著影响模型准确性。在所有部署平台上进行彻底测试至关重要,以避免此类问题。

ONNX Runtime 和 CoreML 可能会静默地将你的模型转换为 FP16 (ym2132.github.io) 8 分,Two_hands 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 2 条评论 DiabloD3 发表于 6 分钟前 [–] 这就是我嘲笑那些所谓的“AI 研究人员”的原因。他们构建像这样的“高质量软件”,而其他人停止瞎折腾,使用 ggml 和 llama.cpp,并且不会遇到这些奇怪的问题。回复 ipython 发表于 2 分钟前 | 父评论 [–] 嗯,那些“AI 研究人员”正忙于在刚印好的本杰明堆里打滚,根本不在乎“高质量软件”。回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Running an ONNX model in ONNX RunTime (ORT) with the CoreMLExecutionProvider may change the predictions your model makes implicitly and you may observe differences when running with PyTorch on MPS or ONNX on CPU. This is because the default arguments ORT uses when converting your model to CoreML will cast the model to FP16.

The fix is to use the following setup when creating the InferenceSession:

EyesOff model, I began evaluating the model and its run time. I was looking into the ONNX format and using it to run the model efficiently. I setup a little test bench in which I ran the model using PyTorch and ONNX with ONNX Runtime (ORT), both using MPS and CPU. While checking the outputs, I noticed that the metrics from the model ran on ONNX on MPS had a different output to those on ONNX CPU and PyTorch CPU and MPS. Note, the metrics from PyTorch on CPU and MPS were the same.

When I say ORT and MPS, this is achieved through ORT’s execution providers. To run an ONNX model on the Mac GPU you have to use the CoreMLExecutionProvider (more on this to come).

Now in Figure 1 and 2, observe the metric values - the PyTorch ones (Figure 1) are the same across CPU and MPS, this isn’t the same story for ONNX (Figure 2):

Figure 1 - PyTorch CPU & MPS metrics output

Figure 2 - ORT CPU & MPS metrics output

Wow, look at the diff in Figure 2! When I saw this it was quite concerning, floating point math can lead to differences in the calculations carried out across the GPU and CPU but the values here don’t appear to be a result of floating point math, the values are too large.

Given the difference in metrics, I was worried that running the model with ORT was changing the output of the model and hence the behaviour. The reason the metrics change is because some of the model predictions around the threshold flipped to the opposite side of the threshold (which is 0.5), this can be seen in the confusion matrices for the ONNX CPU run and MPS run:

FP32 Confusion Matrix

Actual Negative 207 (TN) 24 (FP)
Actual Positive 69 (FN) 164 (TP)

FP16 Confusion Matrix

Actual Negative 206 (TN) 25 (FP)
Actual Positive 68 (FN) 165 (TP)

So two predictions flipped from negative to positive.

Having said that, the first thing I did was to make my life easier, by simplifying the scenario from the large EyesOff model to a simple one layer MLP and using that to run the experiments.

Why am I Using ONNX and ONNX RunTime?

Before going on it’s worth discussing what ONNX and ORT are, and why I’m even using them in the first place.

ONNX1

ONNX stands for Open Neural Network Exchange. It can be thought of as a common programming language in which to describe ML models. Under the hood ONNX models are represented as graphs, these graphs outline the computation path of a model and it shows the operators and transformations required to get from input to prediction. These graphs are called ONNX graphs.

The use of a common language to describe models makes deployment easier and in some cases can add efficiency in terms of resource usage + or inference speed. Firstly, the ONNX graph itself can be optimised. Take PyTorch for example, you train the model in it and sure PyTorch is very mature and extremely optimised but it’s such a large package some things can be overlooked or difficult to change. By converting the model to ONNX, you take advantage of the fact that ONNX was built specifically with inference in mind and with that comes optimisations which the PyTorch team could implement but have not yet.

Furthermore, ONNX models can be ran cross platform in specialised runtimes. These runtimes are specially optimised for different architectures and add another layer of efficiency gains.

ONNX RunTime (ORT)2

ORT is one of these runtimes. ORT actually runs the model, it can be though of as an interpreter, it takes the ONNX graph and actually implements the operators and runs them on the specified hardware. ORT has a lot of magic built into it, the operators are extremely optimised and through the use of execution providers they target a wide range of hardware. Each execution provider is optimised for the specific hardware it refers to, this enables the ORT team to implement extremely efficient operators giving us another efficiency gain.

CoreML3

As mentioned before, I used the CoreMLExecutionProvider to run the model on a Mac GPU. This execution provider informs ORT to make use of CoreML. CoreML is an apple developed framework which lets models (neural networks and classical ML models) run on Apple hardware, CPU, GPU and ANE. ORT’s purpose in this phase is to take the ONNX graph and convert it to a CoreML model. CoreML is Apple’s answer to running efficient on device models on Apple hardware.

Note, that all of this doesn’t always mean the model will run faster. Some models may run faster in PyTorch, TensorRT or any other framework. This is why it is important to benchmark and test as many approaches as makes sense.

Finding the Source of the CPU vs MPS Difference - With an MLP

The MLP used is very simple it has a single layer, with 4 inputs, 3 outputs and the bias turned off. So, I pretty much created a fancy matrix multiplication.

To understand where the issue was coming from I ran this MLP through some different setups:

- PyTorch CPU
- PyTorch MPS
- ORT CPU
- ORT MPS
- CoreML FP32
- CoreML FP16

The goal of this exercise is to find out if 1 - the difference in outputs is seen in a simple model and 2 - to figure out where exactly the issue arises.

Before showing the full results, I want to explain why I included the CoreML FP16 and FP32 runs - specifically why the FP16 one. When I initially ran the MLP experiment I only ran PyTorch, ORT and CoreML FP32 but the output numbers of ORT MPS looked like FP16 numbers. So, I tested if they were and also if the outputs from the other runs were true FP32 numbers. You can do this with a “round trip” test, by converting a number to FP16 and back to FP32. If after this process the number is unchanged then it is an FP16 number but if it changes then it was a true FP32. The number changes as FP16 can represent fewer floating point numbers than FP32. It’s a very simple check to carry out:

FP16 and FP32 specifically, have the following specification:

Single precision 32 23 + 1 sign 8 \(1.2 * 10^{-38}\) \(3.4 * 10^{38}\)
Half precision 16 10 + 1 sign 5 \(5.96 * 10^{-8}\) \(6.55 * 10^{4}\)

So as FP16 is half the size it affords a couple benefits, firstly it requires half the memory to store and secondly it can be quicker to do computations with too. However, this comes at a cost of precision, FP16 cannot represent very small numbers and the distances between small numbers as accurately as FP32.

An example of FP16 vs FP32 - The Largest Number Below 1

  • FP32 - 0.9999999403953552254
  • FP16 - 0.999511725

As you see FP32 can represent a value much closer to 1.

If you need a quick refresher on FP values, please expand the box above. If you already read that or you know enough about FP already let’s look at why some predictions flip.

In my model the base threshold for a 0 or 1 class is 0.5. Both FP16 and FP32 can represent 0.5 exactly:

Claude came in useful here. I gave it the structure of the repo and it told me where I ought to place breakpoints to debug the ORT package (turns out you can debug a cpp package from python, clone the repo, build from src and then link it to your python code using the PID). However, upon running the code the breakpoints weren’t being hit. I was puzzled for a bit, but then I realised why the code wasn’t being hit. It turns out CoreML has two model formats “NeuralNetwork” and “MLProgram”, I will call them NN and MLP formats respectively. The behaviour of the ORT repo changes depending on which you want, as does the behaviour of CoreML, with the default being the NN format. So, the breakpoints weren’t hit as the code was regarding the MLP format whereas I was not setting this so the code flowed through the default NN code. Knowing this I took a step back and began experimenting with NN vs MLP format.

The Fix - NeuralNetwork vs MLProgram CoreML Format

So, CoreML has two model formats, these represent how the model is stored and ran with CoreML. The NeuralNetwork (NN) format is older and the MLProgram (MLP) format is newer. ORT specifies NN format by default, but it does allow you to pass a flag to use MLP format.

Testing the MLP format revealed it as the solution! See below in figure 6 the final output, which includes both ORT MLP and NN format ran on the GPU.

Figure 6 - Running ORT with MLProgram format

So ORT on MPS with NN format has the same difference from the PyTorch CPU baseline as CoreML FP16, whereas ORT with MLP format matches - this is exactly what I wanted. Mystery solved! By setting the model format to be the newer MLProgram format no implicit cast to FP16 takes place.

Why MLProgram Format Worked and Neural Network Didnt?

To understand the difference in behaviours of these two models formats we need to take a deep dive on the internals of CoreML, its goals and the two formats themselves. Let’s begin with CoreML.

CoreML

ORT implements methods to convert the ONNX graph into CoreML model formats. CoreML has two types of model format, this defines how the model is represented in the CoreML framework, how it’s stored and how it’s ran. The older is the NeuralNetwork format and the newer one which solved our issue is the MLProgram format. The reason MLProgram keeps the model at FP32 when running on MPS is due to the differences in model representation in these two formats. Let’s take a look at both of them.

Neural Network Format

As stated, the NN format is the older one, it came out in 2017. It stored models as a Directed Acyclic Graph (DAG). Each layer in the model is a node in the DAG, and they encode information on layer type, list of input names, output names and a collection of parameters specific to the layer type7. We can observe the model which is created by ORT’s InferenceSession call with the following code:

8. The CoreML runtime also specifies which parts of the model operators can run on which hardware9 and each hardware has different abilities in terms of what FP values it can handle with the NN format. Take a look at Figure 7 for Apple’s guide on the hardware and FP types they can handle:

Figure 7 - The FP types Apple hardware will use when running the NN format

When running on CPU the NN format model will run in FP32, as we observed. However, on GPU it is implicitly cast to FP16 even though the input and output are specified to be FP32 as you see in the inspection code above. This is an inherent limitation of the NN format. The DAG structure of the model does not store any information on the types of intermediate layers. You can see this in the inspection output, the part beginning neuralNetwork stored info on the actual layer node, in our case a single linear layer. Observe that there is no information on the FP precision of the node itself, hence CoreML implicitly sets it to FP16.

Why Does CoreML Implicitly Use FP16

From the typed execution docs for coremltools, the goal of CoreML is to run ML models in the most performant way and FP16 happens to be more performant than FP32 (which makes sense as it’s half the precision) on Apple GPUs. Also, they state that most of the time the reduced precision doesn’t matter for inference - this whole blog post shows why this is false and a pitfall of the NN format, the user should choose which precision the model is ran in, it should never be implicit.

MatMul Test - Is FP16 Faster on Apple Hardware?

To test Apple’s claim that FP16 is more performant on Apple hardware I carried out a large matmul. Taking a 16384x16384 matrix and multiplying it with another 16384x16384 matrix should show us if FP16 is faster. The size is arbitrary I just wanted something large.

The matmul was ran 10 times, in both FP32 and FP16 on the MPS hardware, and we take the average:

FP32 Average Time: 8.6521 seconds
FP16 Average Time: 6.7691 seconds

Speedup Factor: 1.28x faster

So FP16 is quicker, which sheds a bit of light on why the NN format has implicit casting to FP16, on paper if you only care about speed then it’s the better option.

Final point on the NeuralNetwork format, it’s surprising as the weights themselves are stored as FP32 values (a roundtrip test verifies this) but it still executes that layer in FP16, once again showing the NN format doesn’t respect the FP precision of the layer but just casts it to FP16.

All that is to say this, this was not a bug but rather an explicit design choice, which funnily enough involves implicitly going against what the user wants. The NN format has its downsides, which is why Apple introduced the MLProgram format, let’s look into that.

The MLProgram (MLP) Format

The MLP format is the newer and better model format in CoreML, released in 2021, the core thing we care about is that the intermediate tensors are typed, i.e. there is no implicit casting when using the MLP format - the user controls whether the model is ran in FP16 or FP32.

MLP format allows for this as it uses a different representation of ML models, instead of a DAG it uses a programmatic representation of the models. By representing the model as code, it allows for greater control over the operations.

Let’s see what this looks like in the stored model format and how it differs to the NN format inspection.

The code to do so is pretty similar:

10, as the compiler community consolidated on tools like LLVM the ML community has too. Intermediate Representations(IR) began to be used in ML, an IR is a hardware agnostic specification of a program which a compiler can optimise. CoreML introduced their own IR, called MIL (Model Intermediate Language) and it implements the output we see in the stored MLProgram output.

The MIL Approach

MIL and IRs in general afford a lot of benefits. They are inherently designed for optimisation and by providing a general framework you can extract maximal value as all optimisation engineers can work on a common goal. In MIL specifically, some of the changes we’ve discussed between NN and MLProgram format, are implemented by it. Namely, each variable within the model has an explicit dtype.

Note, the MLProgram serialises and stores the output of the MIL phase, we’ve already observed how it differs to the the NeuralNetwork model, with the biggest difference being in the explicit types.

Further Reading on ML Compilers

https://huyenchip.com/2021/09/07/a-friendly-introduction-to-machine-learning-compilers-and-optimizers.html

Takeaways

The Fix

The solution to all the issues we discussed today is, if you are using the CoreMLExecutionProvider in ORT then be sure to specify ModelFormat is MLProgram, this will ensure that whatever precision your model was trained it will be ran with that - which in my case was FP32 (whereas the default ModelFormat NeuralNetwork casts the model to FP16).

You can implement this as such:

1https://onnx.ai/onnx/intro/concepts.html

2https://onnxruntime.ai

3https://developer.apple.com/documentation/coreml

4https://en.wikipedia.org/wiki/Single-precision_floating-point_format

5https://en.wikipedia.org/wiki/Half-precision_floating-point_format

6https://floating-point-gui.de/formats/fp/

7https://apple.github.io/coremltools/mlmodel/Format/NeuralNetwork.html

8https://apple.github.io/coremltools/docs-guides/source/typed-execution.html

9https://github.com/microsoft/onnxruntime/issues/21271#issuecomment-3637845056

10https://www.modular.com/blog/democratizing-ai-compute-part-8-what-about-the-mlir-compiler-infrastructure

联系我们 contact @ memedata.com