法学硕士的硬件加速：全面调查和比较

法学硕士的硬件加速：全面调查和比较
Hardware Acceleration of LLMs: A comprehensive survey and comparison

论文《LLM 的硬件加速：综合调查和比较》讨论了大型语言模型 (LLM) 在自然语言处理任务中的使用。它调查了加速这些模型的 Transformer 网络硬件的各种方法，包括 TensorFlow、PyTorch 和 BERT 等框架。作者根据技术、处理平台（FPGA、ASIC、内存、GPU）、加速、能源效率和性能 (GOP) 等因素对这些框架进行了定性和定量比较。然而，由于实现技术的差异，公平地比较这些方案具有挑战性。为了解决这个问题，作者评估了每种方法应用于相同技术时的性能和能源效率。他们还在多个 FPGA 芯片上测试了 LLM 的某些部分，以实现公平的比较。本研究的主要目标是确定法学硕士硬件加速的有效方法，以提高自然语言处理能力。

本文讨论了大型语言模型 (LLM) 中的内存限制问题，该问题是由于 CPU 速度改进和内存带宽改进之间的差距越来越大而导致的。为了解决这个问题，人们正在考虑内存计算（CIM）或内存处理（PIM）等新技术。这些技术涉及直接对内存中的数据执行操作，而不是先将其移至 CPU 寄存器。这种方法可以减少延迟，提高功耗，并有可能克服“内存墙”。然而，在推断性能时需要仔细考虑这些技术之间的比较，特别是对于不同半导体的 ASIC 和 FPGA。值得注意的是，门的低效率和非最佳逻辑路径是与所引用的设计相关的权衡。尽管存在这些挑战，作者还是提出了一种称为 BitGrid 的新颖架构，它利用二维单元阵列进行高度并行计算。与传统方法相比，BitGrid 具有潜在的优势，例如提高能源效率、降低复杂性以及易于实施。然而，在结合存储器和计算元件时会出现实际困难，特别是考虑到存储器和计算元件的制造工艺的差异，导致性能较差、生产成本较高以及对不断变化的需求的适应性较低。此外，作者还质疑 Mythic 等公司关于内存计算能力的说法。由于写入速度慢，闪存被排除在内存计算之外，而 Mythic 的方法涉及利用模拟电路进行模型推理。 D-Matrix 和 Cerebras 是从事真正内存计算的实体的例子，尽管成本高昂。最后，作者指出，闪存主要用作配置存储器，存储模型的恒定权重，而不是用作正在进行的计算的工作存储器。总之，虽然内存计算前景广阔，但它面临着重大的技术障碍，并且能否得到广泛采用仍然存在不确定性。

[Submitted on 5 Sep 2024]

View a PDF of the paper titled Hardware Acceleration of LLMs: A comprehensive survey and comparison, by Nikoletta Koilia and Christoforos Kachris

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. In this paper, we present a comprehensive survey of the several research efforts that have been presented for the acceleration of transformer networks for Large Language Models using hardware accelerators.
The survey presents the frameworks that have been proposed and then performs a qualitative and quantitative comparison regarding the technology, the processing platform (FPGA, ASIC, In-Memory, GPU), the speedup, the energy efficiency, the performance (GOPs), and the energy efficiency (GOPs/W) of each framework. The main challenge in comparison is that every proposed scheme is implemented on a different process technology making hard a fair comparison. The main contribution of this paper is that we extrapolate the results of the performance and the energy efficiency on the same technology to make a fair comparison; one theoretical and one more practical. We implement part of the LLMs on several FPGA chips to extrapolate the results to the same process technology and then we make a fair comparison of the performance.

From: Christoforos Kachris [view email]
[v1] Thu, 5 Sep 2024 09:43:25 UTC (1,209 KB)

法学硕士的硬件加速：全面调查和比较 Hardware Acceleration of LLMs: A comprehensive survey and comparison

法学硕士的硬件加速：全面调查和比较
Hardware Acceleration of LLMs: A comprehensive survey and comparison