矩阵乘法丑陋吗?
Is Matrix Multiplication Ugly?

原始链接: https://mathenchant.wordpress.com/2025/11/21/is-matrix-multiplication-ugly/

## 对矩阵乘法的误解 作者受《纽约客》斯蒂芬·维特一篇关于大型语言模型及其核心运作——矩阵乘法的文章启发而作回应。作者赞扬维特强调了矩阵的重要性——它们在生态学到人工智能等领域都至关重要,但强烈不同意维特认为矩阵乘法缺乏优雅性的观点,引用了数学家G.H.哈迪对“优美”数学的偏好。 作者认为这是一种根本性的误解。矩阵代数*是*对称和变换的语言,矩阵乘法的非交换性(a x b ≠ b x a)并非缺陷,而是其核心特征。它反映了以不同顺序应用变换会产生不同结果,就像拌沙拉或谱写旋律一样。 人们认为的“丑陋”源于手动计算的繁琐,而非概念本身缺乏美感。作者强调,矩阵乘法虽然计算密集,但是一种强大而优雅的工具,是众多科学学科的基础,也是现代数学的基石。它是一种广义的乘法形式,仅仅因为计算难度而否定它,完全错失了重点。

## 矩阵乘法是否笨拙? - Hacker News 讨论总结 最近一篇博文引发了 Hacker News 的讨论,质疑矩阵乘法的“优雅性”,尤其是在其被广泛应用于现代人工智能(如 LLM)的背景下。原作者将其比作“用锤子钉钉子”,暗示其缺乏内在美,尽管它很有用。 许多评论者不同意,认为优雅不在于美学,而在于有效性。一些人指出,熟练的矩阵乘法,就像一位大师工匠的锤击,*可以*是美丽的——尤其是在优化之后(例如,FFT、稀疏矩阵分解)。另一些人强调,“笨拙”源于 LLM 中矩阵运算的巨大规模和蛮力性质,与更结构化的应用形成对比。 对话扩展到关于替代方法(如低秩和块对角矩阵,引用了 Monarch 论文)、硬件并行性的重要性以及计算本身的本质的讨论——暗示所有复杂现象都是由简单组件构建的。最终,争论的中心在于,当一种工具能够解决问题时,以其美学品质来评判它是否相关,即使这些问题需要大量资源。
相关文章

原文

A few weeks ago I was minding my own business, peacefully reading a well-written and informative article about artificial intelligence, when I was ambushed by a passage in the article that aroused my pique. That’s one of the pitfalls of knowing too much about a topic a journalist is discussing; journalists often make mistakes that most readers wouldn’t notice but that raise the hackles or at least the blood pressure of those in the know.

The article in question appeared in The New Yorker. The author, Stephen Witt, was writing about the way that your typical Large Language Model, starting from a blank slate, or rather a slate full of random scribbles, is able to learn about the world, or rather the virtual world called the internet. Throughout the training process, billions of numbers called weights get repeatedly updated so as to steadily improve the model’s performance. Picture a tiny chip with electrons racing around in etched channels, and slowly zoom out: there are many such chips in each server node and many such nodes in each rack, with racks organized in rows, many rows per hall, many halls per building, many buildings per campus. It’s a sort of computer-age version of Borges’ Library of Babel. And the weight-update process that all these countless circuits are carrying out depends heavily on an operation known as matrix multiplication.

Witt explained this clearly and accurately, right up to the point where his essay took a very odd turn.

HAMMERING NAILS

Here’s what Witt went on to say about matrix multiplication:

“‘Beauty is the first test: there is no permanent place in the world for ugly mathematics,’ the mathematician G. H. Hardy wrote, in 1940. But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board. It is possessed of neither beauty nor symmetry: in fact, in matrix multiplication, a times b is not the same as b times a.”

The last sentence struck me as a bizarre non sequitur, somewhat akin to saying “Number addition has neither beauty nor symmetry, because when you write two numbers backwards, their new sum isn’t just their original sum written backwards; for instance, 17 plus 34 is 51, but 71 plus 43 isn’t 15.”

The next day I sent the following letter to the magazine:

“I appreciate Stephen Witt shining a spotlight on matrices, which deserve more attention today than ever before: they play important roles in ecology, economics, physics, and now artificial intelligence (“Information Overload”, November 3). But Witt errs in bringing Hardy’s famous quote (“there is no permanent place in the world for ugly mathematics”) into his story. Matrix algebra is the language of symmetry and transformation, and the fact that a followed by b differs from b followed by a is no surprise; to expect the two transformations to coincide is to seek symmetry in the wrong place — like judging a dog’s beauty by whether its tail resembles its head. With its two-thousand-year-old roots in China, matrix algebra has secured a permanent place in mathematics, and it passes the beauty test with flying colors. In fact, matrices are commonplace in number theory, the branch of pure mathematics Hardy loved most.”

Confining my reply to 150 words required some finesse. Notice for instance that the opening sentence does double duty: it leavens my many words of negative criticism with a few words of praise, and it stresses the importance of the topic, preëmptively1 rebutting editors who might be inclined to dismiss my correction as too arcane to merit publication.

I haven’t heard back from the editors, and I don’t expect to. Regardless, Witt’s misunderstanding deserves a more thorough response than 150 words can provide. Let’s see what I can do with 1500 words and a few pictures.

THE GEOMETRY OF TRANSFORMATIONS

As a static object, matrices are “just” rectangular arrays of numbers, but that doesn’t capture what they’re really about. If I had to express the essence of matrices in a single word, that word would be “transformation”.

One example of a transformation is the operation f that takes an image in the plane and flips it from left to right, as if in a vertical mirror.


Another example is the operation g that that takes an image in the plane and reflects it across a diagonal line that goes from lower left to upper right.


The key thing to notice here is that the effect of f followed by g is different from the effect of g followed by f. To see why, write a capital R on one side of a square piece of paper–preferably using a dark marker and/or translucent paper, so that you can still see the R even when the paper has been flipped over–and apply f followed by g; you’ll get the original R rotated by 90 degrees clockwise. But if instead, starting from that original R, you were to apply g followed by f, you’d get the original R rotated by 90 degrees counterclockwise.

Same two operations, different outcomes! Symbolically we write gffg, where gf means “First do f, then do g” and fg means “First do g, then f”.2 The symbol ◦ denotes the meta-operation (operation-on-operations) called composition.

The fact that the order in which transformations are applied can affect the outcome shouldn’t surprise you. After all, when you’re composing a salad, if you forget to pour on salad dressing until after you’ve topped the base salad with grated cheese, your guests will have a different dining experience than if you’d remembered to pour on the dressing first. Likewise, when you’re composing a melody, a C-sharp followed by a D is different from a D followed by a C-sharp. And as long as mathematicians used the word “composition” rather than “multiplication”, nobody found it paradoxical that in many contexts, order matters.

THE ALGEBRA OF MATRICES

If we use the usual x, y coordinates in the plane, the geometric operation f can be understood as the numerical operation that sends the pair (x, y) to the pair (−x, y), which we can represented via the 2-by-2 array

where more generally the array

stands for the transformation that sends the pair (x, y) to the pair (ax+by, cx+dy). This kind of array is called a matrix, and when we want to compose two operations like f and g together, all we have to do is combine the associated matrices under the rule that says that the matrix

composed with the matrix

equals the matrix

For more about where this formula comes from, see my Mathematical Enchantments essay “What Is A Matrix?”.

There’s nothing special about 2-by-2 matrices; you could compose two 3-by-3 matrices, or even two 1000-by-1000 matrices. Going in the other direction (smaller instead of bigger), if you look at 1-by-1 matrices, the composition of

and

is just

so ordinary number-multiplication arises as a special case of matrix composition; turning this around, we can see matrix-composition as a sort of generalized multiplication. So it was natural for mid-19th-century mathematicians to start using words like “multiply” and “product” instead of words like “compose” and “composition”, at roughly the same time they stopped talking about “substitutions” and “tableaux” and started to use the word “matrices”.

In importing the centuries-old symbolism for number multiplication into the new science of linear algebra, the 19th century algebraists were saying “Matrices behave kind of like numbers,” with the proviso “except when they don’t”. Witt is right when he says that when A and B are matrices, A times B is not always equal to B times A. Where he’s wrong is in asserting that is a blemish on linear algebra. Many mathematicians regard linear algebra as one of the most elegant sub-disciplines of mathematics ever devised, and it often serves as a role model for the kind of sleekness that a new mathematical discipline should strive to achieve. If you dislike matrix multiplication because AB isn’t always equal to BA, it’s because you haven’t yet learned what matrix multiplication is good for in math, physics, and many other subjects. It’s ironic that Witt invokes the notion of symmetry to disparage matrix multiplication, since matrix theory and an allied discipline called group theory are the tools mathematicians use in fleshing out our intuitive ideas about symmetry that arise in art and science.

So how did an intelligent person like Witt go so far astray?

PROOFS VS CALCULATIONS

I’m guessing that part of Witt’s confusion arises from the fact that actually multiplying matrices of numbers to get a matrix of bigger numbers can be very tedious, and tedium is psychologically adjacent to distaste and a perception of ugliness. But the tedium of matrix multiplication is tied up with its symmetry (whose existence Witt mistakenly denies). When you multiply two n-by-n matrices A and B in the straightforward way, you have to compute n2 numbers in the same unvarying fashion, and each of those n2 numbers is the sum of n terms, and each of those n terms is the product of an element of A and an element of B in a simple way. It’s only human to get bored and inattentive and then make mistakes because the process is so repetitive. We tend to think of symmetry and beauty as synonyms, but sometimes excessive symmetry breeds ennui; repetition in excess can be repellent. Picture the Library of Babel and the existential dread the image summons.

G. H. Hardy, whose famous remark Witt quotes, was in the business of proving theorems, and he favored conceptual proofs over calculational ones. If you showed him a proof of a theorem in which the linchpin of your argument was a 5-page verification that a certain matrix product had a particular value, he’d say you didn’t really understand your own theorem; he’d assert that you should find a more conceptual argument and then consign your brute-force proof to the trash. But Hardy’s aversion to brute force was specific to the domain of mathematical proof, which is far removed from math that calculates optimal pricing for annuities or computes the wind-shear on an airplane wing or fine-tunes the weights used by an AI. Furthermore, Hardy’s objection to your proof would focus on the length of the calculation, and not on whether the calculation involved matrices. If you showed him a proof that used 5 turgid pages of pre-19th-century calculation that never mentioned matrices once, he’d still say “Your proof is a piece of temporary mathematics; it convinces the reader that your theorem is true without truly explaining why the theorem is true.”

If you forced me at gunpoint to multiply two 5-by-5 matrices together, I’d be extremely unhappy, and not just because you were threatening my life; the task would be inherently unpleasant. But the same would be true if you asked me to add together a hundred random two-digit numbers. It’s not that matrix-multiplication or number-addition is ugly; it’s that such repetitive tasks are the diametrical opposite of the kind of conceptual thinking that Hardy loved and I love too. Any kind of mathematical content can be made stultifying when it’s stripped of its meaning and reduced to mindless toil. But that casts no shade on the underlying concepts. When we outsource number-addition or matrix-multiplication to a computer, we rightfully delegate the soul-crushing part of our labor to circuitry that has no soul. If we could peer into the innards of the circuits doing all those matrix multiplications, we would indeed see a nightmarish, Borgesian landscape, with billions of nails being hammered into billions of boards, over and over again. But please don’t confuse that labor with mathematics.

Join the discussion of this essay over at Hacker News!

This essay is related to chapter 10 (“Out of the Womb”) of a book I’m writing, tentatively called “What Can Numbers Be?: The Further, Stranger Adventures of Plus and Times”. If you think this sounds interesting and want to help me make the book better, check out http://jamespropp.org/readers.pdf. And as always, feel free to submit comments on this essay at the Mathematical Enchantments WordPress site!

ENDNOTES

#1. Note the New Yorker-ish diaresis in “preëmptively”: as long as I’m being critical, I might as well be diacritical.

#2. I know this convention may seem backwards on first acquaintance, but this is how ◦ is defined. Blame the people who first started writing things like “log x” and “cos x“, with the x coming after the name of the operation. This led to the notation f(x) for the result of applying the function f to the number x. Then the symbol for the result of applying g to the result of applying f to x is g(f(x)); even though f is performed first, “f” appears to the right of “g“. From there, it became natural to write the function that sends x to g(f(x)) as “gf“.

联系我们 contact @ memedata.com