（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40067677

该项目引入了一种在语言模型矩阵（LLM）推理过程中实时管理计算资源的新技术。此方法建议只执行 20-25%，而不是执行所有权重乘法，从而节省大量成本，而不会显着影响模型的性能。它在各种 M1/M2/M3 GPU 上实现，相对于 Llama.cpp 的推理速度有一定的改进，同时有望进一步优化。该方法可以被称为“特别模型蒸馏”。此功能使用户能够动态控制模型的速度和准确性，并有选择地将模型的指定部分加载到内存中。该方法目前可用于浮点 16 (FP16)，但针对四分之一精度 (Q8) 的工作正在进行中。进一步的研究和完善可能会带来超越量化的效率提升。详细描述和开源代码库可以在这里找到：https://kolinko.github.io/effort/。

Here's a project I've been working on for the last few months.

It's a new (I think) algorithm, that allows to adjust smoothly - and in real time - how many calculations you'd like to do during inference of an LLM model.

It seems that it's possible to do just 20-25% of weight multiplications instead of all of them, and still get good inference results.

I implemented it to run on M1/M2/M3 GPU. The mmul approximation itself can be pushed to run 2x fast before the quality of output collapses.

The inference speed is just a bit faster than Llama.cpp's, because the rest of implementation could be better, but with a better development I think it can be a new method to speed up inference - in addition to quantization.

You could call it ad-hoc model distillation :)

You can change the speed / accuracy of a model at will, in real time.

Oh, and as a side effect, the data format allows to also choose how much of the model you want to load into the memory. You can decide to skip say 10-20-40% of the least important weights.

It's implemented for Mistral, it was also tested slightly on Mixtral and Llama. It's for FP16 for now, but Q8 is in the works.

The algorithm is described here, and the implementation is open source.

https://kolinko.github.io/effort/

I know these are bold claims, but I hope they survive the scrutiny :)

（评论） (comments)

（评论）
(comments)