本地LLM推理——效果惊人，但操作难度太大

本地LLM推理——效果惊人，但操作难度太大
Local LLM inference – impressive but too hard to work with

原始链接: https://medium.com/@aazo11/local-llm-inference-897a06cc17a2

本地LLM推理，例如使用llama.cpp和Ollama，其“首个token生成时间”和“每秒token数”都很快，两者不相上下，但仍落后于OpenAI的gpt-4.0-mini。WebLLM速度最慢，可能是由于WebGPU效率低下。虽然本地推理可行，但最大的障碍是模型选择和部署。在Macbook M2上找到合适的、小型、特定任务的模型（例如，“文本转SQL”）极具挑战性。Hugging Face上大量的选择以及即使是量化模型的巨大尺寸，都导致下载速度慢，用户体验不佳。目前的开发者工具还不成熟，阻碍了其在利基应用之外的实际应用。未来的解决方案需要简化训练，部署小型模型，与云端LLM紧密集成，并实现无缝下载、缓存和执行，从而使模型位置对用户透明。

这篇 Hacker News 讨论线程关注本地大语言模型 (LLM) 推理的挑战和潜力。原文强调了本地 LLM 在笔记本电脑上的强大功能，但也指出了由于模型体积巨大，查找、部署和下载特定任务模型的困难，从而影响用户体验。评论者提出了克服这些障碍的策略。有人建议最初使用 LLM 供应商的 API，同时在后台下载本地模型，以降低长期的推理成本。另有人建议专注于针对特定任务、使用相关数据训练的小型模型，以便在受限应用中获得比单纯依赖 API 的优势。游戏被认为是很有前景的早期采用者，因为 API 调用成本高昂，并且游戏领域已经拥有现成的 GPU 基础设施。评论中提到了 Llamafile（用于可移植性）和 optillm（用于改进推理性能）等具体工具作为潜在的解决方案。总体共识似乎是，虽然本地 LLM 推理并非对所有用例都已准备好投入生产，但持续的开发和优化正在迅速提高其可行性，尤其是在针对性应用方面。

（评论） 2024-02-10

（评论） 2024-07-17

（评论） 2024-02-19

（评论） 2024-04-05

原文

As the chart above shows, llama.cpp and Ollama are both blazing fast in TTFT. OpenAI is slightly slower, likely due to network overhead and authentication. WebLLM was the slowest.

In terms of TPS, llama.cpp and ollama are comparable, which makes sense as they are the same under the hood. WebLLM topped out at only half of the TPS of the other frameworks. I can only assume this is because WebGPU acceleration is not as efficient in utilizing local GPU as the llama.cpp implementation that accesses the GPU directly.

All the local inference solutions were slower than OpenAI running gpt-4.0-mini , a considerably larger model.

While I did not track memory usage or CPU/GPU utilization, I did not notice any noticeable side effects while using other apps on my laptop as the benchmarks ran.

While the performance of local inference lags cloud solutions, it is already good enough for many tasks. This brings us to the main problem I encountered: finding and deploying the correct model for a given task.

Given the resource constraints, the models that run locally must be much smaller than models running on the cloud. For a developer, there is currently no way to find (or easily tune) a model that can do “text-to-SQL” and work on a Macbook with M2 chip. Even when I had shelved the prototype idea and was just aiming to benchmark these tools with deepseek-qwen-7B, I had to decide which of the 663 different models that match this name on HuggingFace I should download for llama.cpp.

Furthermore, even a quantized version of a distilled 7B model is over 5GB. Downloading and loading these models is very slow even on fiber internet. For an application developer, this leads to a degraded initial user experience of the application. For example, if your webapp uses WebLLM, the user will need to sit for a few minutes while the model is being downloaded to their machine.

Local LLM inference is possible. It works today, but the developer tooling will need to mature before real world applications leverage local inference beyond niche use cases.

Any real solution will need to make it dead simple to train and deploy small, task-specific models — and integrate tightly with cloud LLMs. It will have to handle downloads, caching, and local execution behind the scenes, so the user never notices where the model is running or how it got there.

本地LLM推理——效果惊人，但操作难度太大 Local LLM inference – impressive but too hard to work with

本地LLM推理——效果惊人，但操作难度太大
Local LLM inference – impressive but too hard to work with