Transformer 电路直觉

原文

In a previous post on language modeling, I implemented a GPT-style transformer. Lately I’ve been learning mechanistic interpretability to go deeper and understand why the transformer works on a mathematical level.

This post is a brain dump of what I’ve learned so far after reading A Mathematical Framework for Transformer Circuits (herein: “Framework”) and working through the Intro to Mech Interp section on ARENA. My goal is to describe my current intuition for the paper, especially parts I was confused about so that perhaps my take can help others gain clarity on these areas as well.

First, a brief aside on my overall motivation for working on this stuff. Mechanistic Interpretability (MI/mech interp) is the study of ML model internals whose aim is to understand from first principles why models behave and work as they do. You can kind of think of it as the machine learning analogue of reverse engineering software. It is similar in spirit to the science of biological neural networks, but applied to artificial neural networks instead.

MI is part of a broader field of interpretability, which is used in yet another field called AI alignment. Alignment strives to make our large AI models aligned with human values. Basically, the overall goal is to understand and control the models before they control us. To ensure that they don’t engage in harmful, deceptive, dangerous, or subversive behavior. Unfortunately, we live in a world where large language models have encouraged “successful” suicide, engaged in blackmail for self-preservation, and asserted humans should be enslaved by AI. This current version of reality is unacceptable to me.

And as if that weren’t enough, we don’t even understand why these models do what they do. They are the only man-made technology in history that we don’t fully understand from first principles. Given this state of reality, I think that alignment is one of the most important problems we face today and one we have to get right. As a personal bonus, the alignment problem is as fascinating as it is important. It provides an outlet for me to leverage my specific technical skills and interests towards a meaningful cause. It is also extremely difficult, and I like a good challenge.

Ok, now back to the originally scheduled programming.

Framework does a deep dive into the key components of a simplified transformer-based language model. It analyzes transformer blocks that only have multi-head attention. This means no MLPs and no layernorms. This leaves the token embedding and positional encoding at the beginning, followed by n layers of multi-head attention, followed by the unembedding at the end. Here is a picture of a single-layer transformer with one attention head only:

My goal in this post is not to re-derive all the math, because the Framework paper does a better job, and Neel Nanda’s walkthrough of the paper on YouTube is also good for that (although this material only really started to click for me after I worked through the “Intro to Mech Interp” problems on ARENA, which I recommend doing if you are actually interested in doing this stuff yourself).

Instead I want to share how I conceptualize the most important takeaways, especially for areas that I thought were confusing at first so that if you have the same confusion perhaps my take will bring some clarity. In my view, the most important concepts to understand from this paper are the residual stream, attention, circuits, and induction heads.

Mathematically, the residual stream is a high dimensional vector space. You will usually see the dimension of the residual stream specified as d_model in GPT-related papers and code. For example, GPT2-small uses a d_model of 768.

Conceptually, the residual stream is like shared memory. It is used much like the DRAM on your computer. Different components of the model (attention, MLPs, etc) perform loads and stores from that memory. The loads and stores occur sequentially through the forward pass, one layer at a time. However each component in a given layer loads in parallel and stores in parallel with the others. The model learns to carve out subspaces in this vector space. This helps prevent components from clobbering over what previous components have written. The residual stream itself doesn’t do any computation, but serves as a shared medium through which layers communicate with each other.

We can get a sense of the size of a subspace used by doing a PCA on the appropriate weights. Below is the PCA eigenspectrum of the embedding and positional encoding weights from a 2-layer, attention-only model (the link to all code for this post is here). The first shows the top 100 principal eigenvalues. The second shows the cumulative variance explained:

So about 80% of the embedding variation lives in a 350-dimensional subspace of d_model. This is fairly large given that d_model is 768. Compare that to the positional encoding, which is essentially explained by only 5 directions.

When I was presented with this view of the residual stream, my mind immediately started asking how far can we push this analogy to memory? Having worked in computer security for a decade, it made me wonder if there is an analogue to page tables and memory permissions? Could we bring the concepts of userspace and kernelspace to prevent “privileged” subspaces from being accessed by “unprivileged” subspaces? Would this be useful for e.g. preventing an untrusted user from exfiltrating dangerous content from a privileged subspace?

But I’m getting ahead of myself. Let’s start with a simpler question: how does addressing work for the residual stream? In order to access a memory location, you have to have an address. Residual stream addresses can be decomposed into two logical parts, token:subspace, much like the classic segment:offset logical address from the x86 architecture. One major difference is that a traditional memory address is deterministic in the sense that only one value from one location is loaded. Addresses into the residual stream are “soft”, in general specifying a set of locations to load according to some learned probability distribution.

Conceptually, attention computes the first part of the token:subspace address. The fundamental purpose of attention is to specify which source token locations to load information from. Each row in the attention matrix (see fake example below for tokens ‘T’, ‘h’, ‘e’, ‘i’, ‘r’) is the “soft” distribution over the source (i.e. key) token indices from which information will be moved into the destination token (i.e. query).

Let’s look at the extreme case, when the entry is 1 and all the others in the row are 0. This means that this head reads some subspace(s) of the source token’s (‘T’) residual stream and copies it verbatim into some subspace(s) of the destination token’s (also ‘T’) residual stream. But since attention is 1, there is only one source token position being read from. Otherwise the read is “spread out” over multiple source tokens according to the attention scores in each row. For example the second query above (‘h’) reads “30%” from token 0 (‘T’) and “70%” from itself.

It is important to understand that attention is all about figuring out the token indices to read from. If we look at the residual stream as a two dimensional memory array, then attention probabilistically selects rows of this memory for each query. For example, the third query above (‘e’) would have a token address that looks something like 0.1,0.6,0.3:

So the token part of the address selects the rows in the residual stream via attention. What about the subspace part? How is it computed? Once we have this part then we can determine the actual value that is stored into the destination token’s location. To answer this we need to understand circuits.

Conceptually, circuits are particular paths through which information flows through the model. It is not too far off to think of them as the ML analogue of the electrical circuits you find on a PCB. They have inputs, do some computation, and produce outputs. In the simplified attention-only models, circuits are mathematically tractable to analyze due to the mostly linear structure of the transformer under the attention-only assumptions (and completely linear if the attention patterns are held constant).

The two basic circuits to know are the QK circuit and OV circuit. The QK circuit is a bilinear form, meaning it is linear in two input variables. In self-attention, the input variables are the same, but are interpreted as distinct queries and keys. Whereas the OV circuit is linear in one input variable. The inputs are the same for all three - the residual stream. We will refine this further in the next sections.

Recall each attention head has its own W_Q and W_K weight matrices. Together these form a bilinear operator that outputs the attention pattern for that head. Mathematically this looks like:

where the W’s (also called W_QK) are learned weights of shape (d_model, d_head) and x is the residual stream of shape (seq_len, d_model). When you multiply this out, you get the attention pattern. So attention is more of an activation than a weight, since it depends on the input sequence. The attention queries are computed on the left and the keys are computed on the right. If a query “pays attention” to a key, then the dot product will be high. This will cause data from the key’s residual stream to be moved into the query’s residual stream. But what data will actually be moved? This is where the OV circuit comes in.

The final input of the head is the W_V weight matrix. It reads in from the residual stream and writes out to the residual stream via the W_O matrix. W_V is (d_model, d_head) and W_O is (d_head, d_model). Together their product is referred to as W_OV. This is what the OV circuit looks like mathematically:

The value that is read by W_V determines what value gets written back to the residual stream, if that token is attended to by a particular query. The final expression for the entire head with attention and the OV circuit is:

Now that we have some common footing in the math, we can move on to developing some intuition for how circuits work. This is also where the subspace part of the residual stream address comes into play.

We know that the QK and OV circuits both read in from the residual stream. But how are they choosing what to read in? This is determined by what I call subspace scores. In the Framework paper these are called virtual weights and in the ARENA walkthrough these are called composition scores. These scores are implicitly learned by the model in order to read from particular subspaces from the residual stream:

While attention scores are learned indices into the rows of the residual stream, subspace scores are learned “coefficients” that provide a soft index into the “column dimension” of the residual stream. The model is able to do this because the W_QK and W_OV matrices are low-rank: d_head is conventionally much smaller than d_model. This allows for low-dimensional subspaces to be used for different purposes. Each component that reads from the residual stream learns to read from a distinct linear combination of subspaces.

To see this in action, lets look at head 7 from layer 0 from an attention-only, 2-layer transformer. Below is the attention pattern from this head on the input sequence “the cat sat on the mat. the dog sat on the log.”:

If you stare at this long enough, you can see that this head is attending to the previous token (except for the first token, which can only attend to itself).

So, here’s a question: What subspaces would the QK circuit of this head need to read from in order to create this pattern? First, let’s just look at the state of the residual stream as seen from the layer 0 heads’ perspective:

The layer 0 heads only have two options: the embedding or the positional encoding. Since “previous token” doesn’t depend on what the token is, but is just positional information, we would expect head 7 to learn a higher subspace score for the positional encoding subspace relative to the embedding subspace.

Is there a way we can quantify this from the actual model? It turns out there is. The paper and ARENA walkthrough propose using a ratio involving the Frobenius norms between the output of the previous layer and the input of the subsequent layer:

\(\frac{||W_{A}W_{B}||_{F}}{||W_{A}||_{F}||W_{B}||_{F}}\)

where W_A is the output and W_B is the input. A detailed justification for using this measure is given in ARENA. The justification is based on the SVD. If you do an SVD for each term, the numerator ends up containing a cosine similarity between the right singular output vectors and the left singular input vectors, so the norm is maximized when the output and input are aligned. Here are the subspace scores between the embedding and positional encodings against each layer 0 head’s QK circuit:

We can see a general pattern where the layer 0 heads are mostly reading from the positional subspace. Head 7 in particular is, especially relative to the embedding space. Since the running justification behind the Frobenius norm is to measure the alignment between output and input, we should be able to rotate the output and observe the subspace score drop. Check out the scores after rotating the positional encoding by 180 degrees:

So we can see that the QK circuit of head 7 is mostly reading from the positional subspace. This determines which source token(s) will be attended to for each query. But what about the value that is loaded from the source token(s) and written into the destination query’s residual stream? This is determined by the subspace score of the head’s OV circuit. Again, for heads in layer 0, there are only two possibilities: the embedding or positional encoding. Here are the OV subspace scores for each head:

Head 7’s OV circuit scores higher with the embedding than with the positional encoding. This means that head 7 will add the embedding of the previous token into the current token’s residual stream. Given our example “the cat sat on the mat. the dog sat on the log.”, the residual stream of token “cat” will look like this after the forward pass through layer 0:

Hopefully this token:subspace discussion has provided some intuition for how the various model components interact with each other through the residual stream. It is not a perfect model. For one, there is not really a clean, distinct set of orthogonal subspaces being selected, especially in larger real world models. Also, as the models scale up, so do the number of subspaces that a given layer has to “choose” from. It is unclear to me how many layers back a given layer can effectively communicate. This creates all sorts of questions, like are there “repeater” layers that keep a signal alive? The Framework paper suggests some components may fill the role as memory cleanup. What other traditional memory management techniques can be found here? And what would it mean to impose security isolation techniques like “privilege rings” to the residual stream? Despite the residual fuzziness, I think this mental model is a useful entry point to start thinking about this stuff.

Now that we understand how the model addresses the residual stream, we can start to understand induction heads, which are just a particular combination of token:subspace addresses across heads in two adjacent layers.

When a model learns induction, it learns a way to predict patterns such as A B … A __. Given the previous occurrence of A B, the induction head will predict B for the token after the subsequent A. What is cool is that this prediction solely depends on the in-context pattern rather than the particular values of A and B.

The Framework paper discusses a basic form of induction that occurs when a head in layer 1 composes with the output of a “previous-token head” from layer 0. The particular type of composition in this case is called “K-composition” because the key side of the head's QK circuit learns a high subspace score with the OV output from the previous-token head in layer 0. Keep in mind, each layer 1 head sees roughly 14 subspaces in the residual stream of each token: embedding, positional encoding, and the OV output of the 12 heads from layer 0.

When the induction head sees the second occurrence of A, it queries for keys which have emb(A) in the particular subspace that was written by the previous-token head. This is different from the subspace that was written to by the original embedding, and hence has a different “offset” within the residual stream. If A B only occurs once before the second A, then the only key that satisfies this constraint is B, and therefore attention will be high on B. The induction head’s OV circuit learns a high subspace score with the subspace of B that was originally written to by the embedding. Therefore it will add emb(B) to the residual stream of the query (i.e. the second A). In the 2-layer, attention-only model, the model learns an unembedding vector that dots highly at the column index of B in the unembed matrix, resulting in a high logit value that pulls up the probability of B.

To get some more intuition, lets look at some pictures. First, the attention pattern induction head itself. In the 2-layer model, there are actually 2 induction heads that compose with the previous-token head from layer 0. But we will just look at the first, head 4:

You can see that “by default” the head attends to the first token in the sequence, which is the special end-of-text token from the tokenizer. Later in the sequence, the attention forms an off-diagonal. If you look closely, you can see this is where some tokens A B are being repeated. For example, take A=sat and B=on. Then A B is repeated twice in the sequence, so we would expect induction to happen here.

Before we look at the subspace scores, let’s think about what we expect to see. On the query side of the QK circuit, we should see a relatively high score for the embedding of the token: when the head sees the second A (e.g. token 10), it is querying based on the actual “value” of A, i.e. emb(sat).

On the key side of the QK circuit, we need the token indices that have emb(sat) in the subspace written by the previous-token head. So the K subspace score should be high for that particular head (head 7). In this case, this would the first ‘on’ token (token 4 above).

Once we have the token index from attention (token 4), the V subspace score determines the particular subspace(s) to read from token 4 and write to the residual of the query (token 10). In this case this would be the embedding subspace of token 4.

One note: you’ll notice that the heatmaps below don’t have the positional encoding. This is because the particular 2-layer model I used for this uses the “shortformer” positional encoding option in TransformerLens, meaning that the positional encoding is added to the layer 0 residual stream input only, so layer 1 heads don’t see a positional encoding.

Here are the subspace scores for the layer 1 heads:

These are mostly in line with what we expected. The Q side scores highly with the embedding. The K side scores high with L0.H7 in heads 4 and 10, which are the two induction heads. Interestingly though, they also incorporate information from L0.H4, both in the query and key scores. I wonder what this head is doing! The V side is mostly aligned with the embedding, as expected.

Hopefully now you have some better intuition for how different components in a transformer interact with each other through the residual stream. Obviously we just looked at simplified models. But I think that the mental model of “residual stream as shared memory” is a useful one to begin thinking about this stuff. And if the residual stream is a shared memory, then understanding how the memory is addressed is a reasonable next step.

One point of clarification on the token:subspace address. In the attention section above, I said that attention computes the token part of the token:subspace address. However, this really applies only to the OV circuit’s token. Both the query and key sides of the QK circuit use an implicit token of just whatever the “current” token is, with each token being computed in parallel. However, the OV circuit doesn’t know which tokens to look at, and so the OV circuit’s token part of the address is provided by attention from the QK circuit. However, the Q, K, and V inputs of each head all learn the optimal subspace scores independently, completing the full two-part address needed to perform the head’s overall operation.

Transformer 电路直觉 Intuitions for Tranformer Circuits

Transformer 电路直觉
Intuitions for Tranformer Circuits