（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40977103

近年来，大型语言模型（LLM）取得了重大进展。 ChatGPT 发布后，该领域一片兴奋，但开源景观却落后了。可用的最好的开源大型语言模型基础模型是 GPT-2，但它落后了两代。然后，Meta 推出了 LLAMA，这是一个训练有素的基础模型，它点燃了开源模型的增长。 LLAMA 已集成到 Hugging Face Transformers 库中，并且权重文件可在 Hugging Face 网站上访问供每个人使用。最初，由于 RAM 要求较高且计算资源有限，在本地操作 LLAMA 具有挑战性。开发人员创建了通过量化来减小模型大小的方法，Llama.cpp 等项目变得越来越突出。 Hugging Face Transformers 还通过位和字节添加了量化支持。随着时间的推移，量化技术的改进使得 LLAMA 能够在各种系统上以最小的 RAM 使用量运行，同时保持高精度。用于微调 LLAMA 的工具出现，导致大量 LLAMA 微调提高了准确性。其他公司，如斯坦福、LMSYS、微软、01AI、Mistral 和其他几家公司开发了出色的 LLAMA 微调，进一步提高了准确性。新的推理引擎（例如 vLLM）专注于高效运行 LLM，出现了，提供了更好的 AMD GPU 支持，使开放的 LLM 能够在具有 AMD GPU 的机器上以合理的效率在本地使用。接下来，Meta 推出了 LLAMA 2，被认为是当时最好的开源 LLM。 RLHF 指令针对聊天和人类评估数据的微调证实了其相对于现有开源法学硕士的优越性。对 LLAMA 2 的支持很快就添加到了 Llama.cpp 和 Hugging Face Transformers 中。目前，尽管取得了进步，但法学硕士的运营仍然很复杂，需要大量的技术知识和时间投入。 GPT4All、Ollama 和 Exllama 等简化工具的出现简化了流程，减少了有兴趣利用 LLM 的个人的进入壁垒。此外，Cohere还发布了自己的LLM，名为Command R+，专为上下文长度为128k的RAG相关任务而设计。最近，LLAMA 3 亮相

Here's a summary of what's happened the past couple of years and what tools are out there.

After ChatGPT released, there was a lot of hype in the space but open source was far behind. Iirc the best open foundation LLM that existed was GPT-2 but it was two generations behind.

Awhile later Meta released LLaMA[1], a well trained base foundation model, which brought an explosion to open source. It was soon implemented in the Hugging Face Transformers library[2] and the weights were spread across the Hugging Face website for anyone to use.

At first, it was difficult to run locally. Few developers had the system or money to run. It required too much RAM and iirc Meta's original implementation didn't support running on the CPU but developers soon came up with methods to make it smaller via quantization. The biggest project for this was Llama.cpp[3] which probably is still the biggest open source project today for running LLMs locally. Hugging Face Transformers also added quantization support through bitsandbytes[4].

Over the next months there was rapid development in open source. Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy on more and more systems. Tools came out that were capable of finetuning LLaMA and there were hundreds of LLaMA finetunes that came out which finetuned LLaMA on instruction following, RLHF, and chat datasets which drastically increased accuracy even further. During this time, Stanford's Alpaca, Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a few others made their way onto the open LLM scene with some very good LLaMA finetunes.

A new inference engine (software for running LLMs like Llama.cpp, Transformers, etc) called vLLM[5] came out which was capable of running LLMs in a more efficient way than was previously possible in open source. Soon it would even get good AMD support, making it possible for those with AMD GPUs to run open LLMs locally and with relative efficiency.

Then Meta released Llama 2[6]. Llama 2 was by far the best open LLM for its time. Released with RLHF instruction finetunes for chat and with human evaluation data that put its open LLM leadership beyond doubt. Existing tools like Llama.cpp and Hugging Face Transformers quickly added support and users had access to the best LLM open source had to offer.

At this point in time, despite all the advancements, it was still difficult to run LLMs. Llama.cpp and Transformers were great engines for running LLMs but the setup process was difficult and required a lot of time. You had to find the best LLM, quantize it in the best way for your computer (or figure out how to identify and download one from Hugging Face), setup whatever engine you wanted, figure out how to use your quantized LLM with the engine, fix any bugs you made along the way, and finally figure out how to prompt your specific LLM in a chat-like format.

However, tools started coming out to make this process significantly easier. The first one of these that I remember was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which made it easy to install, easy to select the LLM that you want (pre-quantized options for easy download from a download manager), and a chat UI which made LLMs easy to use. This significantly reduced the barrier to entry for those who were interested in using LLMs.

The second project that I remember was Ollama[8]. Also a wrapper around Llama.cpp, Ollama gave most of what GPT4All had to offer but in an even simpler way. Today, I believe Ollama is bigger than GPT4All although I think it's missing some of the higher-level features of GPT4All.

Another important tool that came out during this time is called Exllama[9]. Exllama is an inference engine with a focus on modern consumer Nvidia GPUs and advanced quantization support based on GPTQ. It is probably the best inference engine for squeezing performance out of consumer Nvidia GPUs.

Months later, Nvidia came out with another new inference engine called TensorRT-LLM[10]. TensorRT-LLM is capable of running most LLMs and does so with extreme efficiency. It is the most efficient open source inferencing engine that exists for Nvidia GPUs. However, it also has the most difficult setup process of any inference engine and is made primarily for production use cases and Nvidia AI GPUs so don't expect it to work on your personal computer.

With the rumors of GPT-4 being a Mixture of Experts LLM, research breakthroughs in MoE, and some small MoE LLMs coming out, interest in MoE LLMs was at an all-time high. The company Mistral had proven itself in the past with very impressive LLaMA finetunes, capitalized on this interest by releasing Mixtral 8x7b[11]. The best accuracy for its size LLM that the local LLM community had seen to date. Eventually MoE support was added to all inference engines and it was a very popular mid-to-large sized LLM.

Cohere released their own LLM as well called Command R+[12] built specifically for RAG-related tasks with a context length of 128k. It's quite large and doesn't have notable performance on many metrics, but it has some interesting RAG features no other LLM has.

More recently, Llama 3[13] was released which similar to previous Llama releases, blew every other open LLM out of the water. The smallest version of Llama 3 (Llama 3 8b) has the greatest accuracy for its size of any other open LLM and the largest version of Llama 3 released so far (Llama 3 70b) beats every other open LLM on almost every metric.

Less than a month ago, Google released Gemma 2[14], the largest of which, performs very well under human evaluation despite being less than half the size of Llama 3 70b, but performs only decently on automated benchmarks.

If you're looking for a tool to get started running LLMs locally, I'd go with either Ollama or GPT4All. They make the process about as painless as possible. I believe GPT4All has more features like using your local documents for RAG, but you can also use something like Open WebUI[15] with Ollama to get the same functionality.

If you want to get into the weeds a bit and extract some more performance out of your machine, I'd go with using Llama.cpp, Exllama, or vLLM depending upon your system. If you have a normal, consumer Nvidia GPU, I'd go with Exllama. If you have an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For anything else, including just running it on your CPU or M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes sense if you have an AI Nvidia GPU like the A100, V100, A10, H100, etc.

[1] https://ai.meta.com/blog/large-language-model-llama-meta-ai/

[2] https://github.com/huggingface/transformers

[3] https://github.com/ggerganov/llama.cpp

[4] https://github.com/bitsandbytes-foundation/bitsandbytes

[5] https://github.com/vllm-project/vllm

[6] https://ai.meta.com/blog/llama-2/

[7] https://www.nomic.ai/gpt4all

[8] http://ollama.ai/

[9] https://github.com/turboderp/exllamav2

[10] https://github.com/NVIDIA/TensorRT-LLM

[11] https://mistral.ai/news/mixtral-of-experts/

[12] https://cohere.com/blog/command-r-plus-microsoft-azure

[13] https://ai.meta.com/blog/meta-llama-3/

[14] https://blog.google/technology/developers/google-gemma-2/

[15] https://github.com/open-webui/open-webui

（评论） (comments)

（评论）
(comments)