谷歌的第一个张量处理单元：架构

谷歌的第一个张量处理单元：架构
Google's First Tensor Processing Unit: Architecture

原始链接: https://thechipletter.substack.com/p/googles-first-tpu-architecture

标题：Google 张量处理单元：概述谷歌于 2016 年推出的张量处理单元 (TPU) 是这家科技巨头首款用于机器学习推理的定制 ASIC。 TPU 的开发时间不到 18 个月，旨在与传统 GPU 相比，显着提高机器学习任务的性价比。以下是 Google TPU 的起源和架构，特别关注其动机、设计目标和独特功能。 **背景**：认识到基于深度学习的服务带来的机遇及其带来的巨大计算需求，Google 开始创建专用集成电路 (ASIC)，与 GPU 相比，该集成电路在推理过程中成本降低了十倍，同时保持高性能并立即为新工作负载提供结果——所有这些都在预算之内。 **张量运算**：理解 Google TPU 作用的核心是识别张量处理，以张量运算命名，对于描述数组之间的关系以及多维数学计算的结果至关重要。矩阵乘法在此过程中起着至关重要的作用，并构成构建神经网络的基础。 **架构和性能**：TPU 的架构在很大程度上依赖于**脉动系统**的概念，其中处理器网络控制计算节奏并通过系统传递信息。这允许有效执行复杂的矩阵转换，最终导致准确性和能耗方面的显着改进。此外，TPU 采用高度并行设计，允许同时处理多个较小的任务，从而显着提高整体性能。此外，TPU 采用片上本地内存，最大限度地减少延迟并提高带宽利用率。最终，通过这些架构决策，TPU 超越了现有 GPU 性能，同时保持了低能耗。

谷歌不想分享知识产权（因为它很有价值），也不想支付许可费（因为谷歌很便宜）。因此，他们决定将设计和制造保密，只让 Broadcom 处理 IO 并将其映射到台积电的工具集。确实如此，但并不完全清楚。谷歌似乎在与博通合作之前就申请了脉动阵列部件的专利，然后就关门了。专利申请提到英特尔与他们共同提交了这些文件，合作协议提到了英特尔与该知识产权相关的一项具体交易。目前有关此事的法庭案件正在进行中，所以谁知道呢。事实似乎介于两者之间。虽然我没仔细看，但从我看到的文章来看，这个IP从一开始就主要是Google自己的，他们把它授权给了Broadcom来设计和制造。看来英特尔可能为谷歌的工作提供了一些资金，可能作为稍后获得部分结果的回报。英特尔对TPU的贡献远大于此。英特尔和谷歌联合转让了多项专利，其中英特尔被列为受让人，这些专利描述了谷歌 TPU 中存在的功能。这些专利明确将谷歌员工与英特尔员工一起列为发明人。此外，谷歌论文引用的论文范围广泛，还包括英特尔员工撰写的论文。尽管英特尔因 TPU 设计无法在商业上实现大规模生产而正式退出竞赛，但他们为硬件的开发做出了重大贡献，即使谷歌拥有并控制了它。我的错，你说得完全正确。英特尔做出了一些重大贡献。这是一个例子，它清楚地表明英特尔负责找出如何有效地将矩阵乘法融入芯片中。 https://ieeexplore.ieee.org/document/7677938 非常感谢您纠正我，您说得完全正确。我的错误，对于之前的错误断言感到抱歉。英特尔确实对 TPU 的开发做出了重大贡献，我误解了谷歌和博通之间合作伙伴关系的性质。我只是一个工程师，不是历史记录方面的专家。但根据我读到的有关谷歌的所有内容

原文

… we say tongue-in-cheek that TPU v1 “launched a thousand chips.”

In Google’s First Tensor Processing Unit - Origins, we saw why and how Google developed the first Tensor Processing Unit (or TPU v1) in just 15 months, starting in late 2013.

Today’s post will look in more detail at the architecture that emerged from that work and at its performance.

A quick reminder of the objectives of the TPU v1 project. As Google saw not only the opportunities provided by a new range of services using Deep Learning but also the huge scale and the cost of the hardware that would be needed to power these services, the aims of the project would be …

… to develop an Application Specific Integrated Circuit (ASIC) that would generate a 10x cost-performance advantage on inference when compared to GPUs.

and to …

Build it quickly
Achieve high performance
......at scale
...for new workloads out-of-the-box...
all while being cost-effective

Before we look at the TPU v1 that emerged from the project in more detail, a brief reminder of the Tensor operations that give the TPU its name.

Why is a Tensor Processing Unit so called? Because it is designed to speed up operations involving tensors. Precisely, what operations though? The operations are referred to … as a “map (multilinear relationship) between different objects such as vectors, scalars, and even other tensors”.
Let’s take a simple example. A two-dimensional array can describe a multilinear relationship between two one-dimensional arrays. The mathematically inclined will recognize the process of getting from one vector to the other as multiplying a vector by a matrix to get another vector.
This can be generalized to tensors representing the relationship between higher dimensional arrays. However, although tensors describe the relationship between arbitrary higher-dimensional arrays, in practice the TPU hardware that we will consider is designed to perform calculations associated with one and two-dimensional arrays. Or, more specifically, vector and matrix operations.

Let’s look at one of these operations, matrix multiplication. If we take two 2x2 matrices (2x2 arrays) then we multiply them together to get another 2x2 matrix by multiplying the elements as follows.

Why are matrix multiplications key to the operation of neural networks? We can look at a simple neural network with four layers as follows (only the connections from the first node in each later layer are shown for simplicity):

Where ‘f’ here is the activation function.

So the hidden and output layers are the results of applying the activation function to each element of the vector which is the result of multiplying the vector of input values times the matrix of weights. With a number of data inputs this is equivalent to applying the activation function to each entry in a matrix that is the result of a matrix multiplication.

As we’ve seen, the approach adopted by the TPU v1 team was an architecture first set out by H.T Kung and Charles E. Leiserson in their 1978 paper Systolic Arrays (for VLSI).

A systolic system is a network of processors which rhythmically compute and pass data through the system….In a systolic computer system, the function of a processor is analogous to that of the heart. Every processor regularly pumps data in and out, each time performing some short computation so that a regular flow of data is kept up in the network.

So how is the systolic approach used in the TPU v1 to efficiently perform matrix multiplications? Let’s return to our 2x2 matrix multiplication example.

If we have a 2x2 array of multiplication units that are connected in a simple grid, and we feed the elements of the matrices that we are multiplying, into the grid in the right order then the results of the matrix multiplication will naturally emerge from the array.

The calculation can be represented in the following diagram. The squares in each corner represent a multiply / accumulate unit (MAC) that can perform a multiplication and addition operation.

In this diagram, the values in yellow are the inputs that are fed into the matrix from the top and the left. The light blue values are the partial sums that are stored. The dark blue values are the final results

.Let’s take it step by step.

Step 1:
Values a11 and b11 are loaded into the top left multiply/accumulate unit (MAC). They are multiplied together and the result is stored.
Step 2:
Values a12 and b21 are loaded into the top left MAC. They are multiplied together and added to the previously calculated result. This gives the top left value of the results matrix.
Meanwhile, b11 is transferred to the top right MAC where it is multiplied by the newly loaded value a21 and the result is stored. Also, a11 is transferred to the bottom left MAC where it is multiplied by the newly loaded value b12, and the result is stored.
Step 3:
b21 is transferred to the top right MAC where it is multiplied by the newly loaded value a22 and the result is added to the previously stored result. Also, a12 is transferred to the bottom left MAC where it is multiplied by the newly loaded value b22, and the result is added to the previously stored result. In this step, we have calculated the top right and bottom left values of the results matrix.
Meanwhile, a12 and b21 are transferred to the bottom right MAC where they are multiplied and the result is stored.
Step 4:
Finally, a22 and b22 are transferred to the bottom right MAC where they are multiplied and the result is added to the previously stored value giving the bottom right value of the results matrix.
So the results of the matrix multiplication emerge down a moving ‘diagonal’ in the matrix of MACs.

In our example, it takes 4 steps to do a 2 x 2 matrix multiplication, but only because some of the MACs are not utilized at the start and end of the calculation. In practice, a new matrix multiplication would start top left as soon as the MAC is free. As a result the unit is capable of a new matrix multiplication every two cycles.

This is a simplified representation of how a systolic array works and we’ve glossed over some of the details of the implementation of the systolic array in TPU v1. I hope that the principles of how this architecture works are clear though.

This is the simplest possible matrix multiplication but can be extended to bigger matrices with larger arrays of multiplication units.

The key point is that if data is fed into the systolic array in the right order then the flow of values and results through the system will ensure that the required results emerge from the array over time.

Crucially there is no need to store and fetch intermediate results from a ‘main memory’ area. Intermediate results are automatically available when needed due to the structure of the matrix multiply unit and the order in which inputs are fed into the unit.

Of course, the matrix multiply unit does not sit in isolation and the simplest presentation of the complete system is as follows:

The first thing to note is that TPUv1 relies on communication with the host computer over a PCIe (high speed serial bus) interface. It also has direct access to its own DDR3 Dynamic RAM storage.

We can expand this to a more detailed presentation of the design:

Let’s pick some key elements from this presentation of the design, starting at the top and moving (broadly) clockwise:

DDR3 DRAM / Weight FIFO: Weights are stored in DDR3 RAM chips connected to the TPU v1 via DDR3-2133 interfaces. Weights are ‘pre-loaded’ onto these chips from the host computer’s memory via PCIe and can then be transferred into the ‘Weight FIFO’ memory ready for use by the matrix multiply unit.
Matrix Multiply Unit: This is a ‘systolic’ array with 256 x 256 matrix multiply/accumulate units that is fed by 256 ‘weight’ values from the top and 256 data inputs from the left.
Accumulators: The results emerge from the systolic matrix unit at the bottom and are stored in ‘accumulator’ memory storage.
Activation: The activation functions described in the neural network above are applied here.
Unified Buffer / Systolic Data Setup: The results of applying the activation functions are stored in a ‘unified buffer’ memory where they are ready to be fed back as inputs to the Matrix Multiply Unit to calculate the values needed for the next layer.

So far we haven’t specified the nature of the multiplications performed by the matrix multiply unit. TPU v1 performs 8-bit x 8-bit integer multiplications, making use of quantization to avoid the need for more die-area-hungry floating-point calculations.

The TPU v1 uses a CISC (Complex Instruction Set Computer) design with around only about 20 instructions. It’s important to note that these instructions are sent to it by the host computer over the PCIe interface, rather than being fetched from memory.

The five key instructions are as follows:

Read_Host_Memory

Reads input values from the host computer’s memory into the Unified Buffer over PCIe.

Read_Weights

Read weights from the weight memory into the Weight FIFO. Note that the weight memory will already have been loaded with weights read from the computer’s main memory over PCIe.

Matrix_Multiply / Convolve

From the paper this instruction

… causes the Matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the Accumulators. A matrix operation takes a variable-sized B*256 input, multiplies it by a 256x256 constant weight input, and produces a B*256 output, taking B pipelined cycles to complete.

This is the instruction that implements the systolic array matrix multiply. It can also perform convolution calculations needed for Convolutional Neural Networks.

Activate

From the paper this instruction

Performs the nonlinear function of the artificial neuron, with options for ReLU, Sigmoid, and so on. Its inputs are the Accumulators, and its output is the Unified Buffer.

If we go back to our simple neural network model the values in the hidden layers are the result of applying an ‘activation function’ to the sum of the weights multiplied by the inputs. ReLU and Sigmoid are two of the most popular activation functions. Having these implemented in hardware will have provided a useful speed-up in the application of the activation functions.

Write_Host_Memory

Writes results to the host computer’s memory from the Unified Buffer over PCIe.

It’s probably worth pausing for a moment to reflect on the elegance of these five instructions in providing an almost complete implementation of inference in the TPU v1. In pseudo-code, we could describe the operation of the TPU v1 broadly as follows:

Read_Host_Memory
Read_Weights
Loop_Start
    Matrix_Multiply
    Activate
Loop_End
Write_Host_Memory

It’s also useful to emphasize the importance of the systolic unit in making this possible and efficient. As described by the TPU v1 team (and as we’ve already seen):

.. the matrix unit uses systolic execution to save energy by reducing reads and writes of the Unified Buffer …. It relies on data from different directions arriving at cells in an array at regular intervals where they are combined. … data flows in from the left, and the weights are loaded from the top. A given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront.

The TPU v1’s hardware would be of little use without a software stack to support it. Google developed and used Tensorflow so creating ‘drivers’ so that Tensorflow could work with the TPU v1 was the main step needed.

The TPU software stack had to be compatible with those developed for CPUs and GPUs so that applications could be ported quickly to the TPU. The portion of the application run on the TPU is typically written in TensorFlow and is compiled into an API that can run on GPUs or TPUs.
Like GPUs, the TPU stack is split into a User Space Driver and a Kernel Driver. The Kernel Driver is lightweight and handles only memory management and interrupts. It is designed for long-term stability. The User Space driver changes frequently. It sets up and controls TPU execution, reformats data into TPU order, translates API calls into TPU instructions, and turns them into an application binary.

As we saw in our earlier post, the TPU v1 was fabricated by TSMC using a relatively ‘mature’ 28nm TSMC process. Google has said that the die area is less than half the die area of the Intel Haswell CPU and Nvidia’s K80 GPU chips, each of which was built with more advanced processes, that Google was using in its data centers at this time.

We have already seen how simple the TPU v1’s instruction set was, with just 20 CISC instructions. The simplicity of the ISA leads to a very low ‘overhead’ in the TPU v1’s die for decoding and related activities with just 2% of the die area dedicated to what are labeled as ‘control’.

By contrast, 24% of the die area is dedicated to the Matrix Multiply Unit and 29% to the ‘Unified Buffer’ memory that stores inputs and intermediate results.

At this point, it’s useful to remind ourselves that the TPU v1 was designed to make inference - that is the use of already trained models in real-world services provided at Google’s scale - more efficient. It was not designed to improve the speed or efficiency of training. Although they have some features in common inference and training provide quite different challenges when developing specialized hardware.

So how did the TPU v1 do?

In 2013 the key comparisons for the TPU v1 were with Intel’s Haswell CPU and Nvidia’s K80 GPU.

And crucially the TPU v1 was much more energy efficient that GPUs:

In the first post on the TPU v1, we focused on the fact that an organization like Google could marshal the resources to build the TPU v1 quickly.

In this post we’ve seen how the custom architecture of the TPU v1 was crucial in enabling it to generate much better performance with much lower energy use than contemporary CPUs and GPUs.

The TPU v1 was only the start of the story. TPU v1 was designed quickly and with the sole objective of making inference faster and more power efficient. It had a number of clear limitations and was not designed for training. Both inside and outside Google firms would soon start to look at how TPU v1 could be improved. We’ll look at some of its successors in later posts.

After the paywall, a small selection of further reading and viewing on Google’s TPU v1.

谷歌的第一个张量处理单元：架构 Google's First Tensor Processing Unit: Architecture

谷歌的第一个张量处理单元：架构
Google's First Tensor Processing Unit: Architecture