The Training Example Lie Bracket

原始链接: https://pbement.com/posts/lie_brackets/

这项研究调查了训练样本顺序如何影响神经网络的学习，这种现象在理想贝叶斯模型中没有被考虑。核心思想是将每个训练样本视为一个向量场，指示参数更新的方向。通过计算这些向量场的“李括号”——一种衡量它们非交换性的数学运算——研究人员可以量化交换两个训练样本的影响。李括号揭示了根据更新顺序，最终参数值的差异，并且其大小与学习率的平方成正比。这意味着即使是微小的顺序改变也会随着时间的推移而累积。该研究明确计算了在CelebA数据集上训练的卷积神经网络（一种修改后的MXResNet）的这些李括号。结果表明，交换样本*确实*会扰动网络对测试数据的预测（logits），并且这种扰动的大小在训练过程中会发生变化。这项工作基于先前研究，确定了李括号与神经网络中隐式偏差的联系，并提供了一种在实践中分析顺序依赖性的具体方法。

黑客新闻新的 | 过去的 | 评论 | 提问 | 展示 | 工作 | 提交登录训练示例谎括号 (pbement.com) 6 分，由 pb1729 1小时前 | 隐藏 | 过去的 | 收藏 | 1 条评论帮助 E-Reverance 3分钟前 [–] 这能用于批量过滤吗？回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

skip to results

An ideal machine learning model would not care what order training examples appeared in its training process. From a Bayesian perspective, the training dataset is unordered data and all updates based on seeing one additional example should commute with each other. For neural nets trained by gradient descent, however, this is not the case. This webpage will explain how to compute the effects of swapping the order of two training examples on a per-parameter level, and show the results of computing these quantities for a simple convnet model.

To get started, we just need to recognize one simple mathematical fact:

Training Examples are Vector Fields

If we are training a neural network with parameters $\theta \in \Theta = \mathbb{R}^\text{num params}$, then we can treat each training example as a vector field. In particular, if $x$ is a training example and $\mathcal{L}^{(x)}$ is the per-example loss for the training example $x$, then this vector field is:

$$ v^{(x)}(\theta) = -\nabla_{\theta} \mathcal{L}^{(x)} $$

In other words, for a specific training example, the arrows of the resulting vector field point in the direction that the parameters should be updated.

In this view, a gradient update basically looks like moving in the direction of the vector field by the learning rate $\epsilon$.

$$ \theta' = \theta + \epsilon v^{(x)}(\theta). $$

The Training Example Lie Bracket

One thing we can do with vector fields is to compute their Lie bracket. So if $x, y$ are training examples, we may compute:

$$ [v^{(x)}, v^{(y)}] = (v^{(x)}\cdot \nabla_\theta) v^{(y)} - (v^{(y)}\cdot \nabla_\theta) v^{(x)} $$

We can compute the Lie bracket of any two vector fields on $\Theta$, and so we can certainly compute the Lie bracket of the vector fields arising from two training examples. The Lie bracket of two training examples tells us about the order dependence of training on those examples. The Lie bracket of a vector field is itself a vector field, and so just like a gradient, we get a Lie bracket tensor for each parameter tensor of the same shape as that parameter tensor.

The Lie Bracket Tells us About Order-Dependence

We can interpret this quantity as the difference between updating on $x$ before $y$ vs after. Let's Taylor expand to see this. If $\epsilon$ is the learning rate, we'll want to expand to $O(\epsilon^2)$:

$$\theta' = \theta + \epsilon v^{(x)}(\theta)$$ $$ \theta'' = \theta' + \epsilon v^{(y)}(\theta') $$ $$= \theta + \epsilon v^{(x)}(\theta) + \epsilon v^{(y)}(\theta) + \epsilon^2 (v^{(x)}(\theta) \cdot \nabla_\theta) v^{(y)}(\theta)$$

Now if we update $x,y$ in the other order, we get an $O(\epsilon^2)$ difference in the resulting parameters $\theta''$. Namely:

$$ \Delta \theta'' = \epsilon^2 \left( (v^{(x)}(\theta) \cdot \nabla_\theta) v^{(y)}(\theta) - (v^{(y)}(\theta) \cdot \nabla_\theta) v^{(x)}(\theta) \right) $$ $$ \Delta \theta'' = \epsilon^2 [v^{(x)}, v^{(y)}] (\theta) $$

So here we can see the significance of the Lie bracket: It tells us the difference in where our parameters end up based on which order we show the training examples in.

Note that by the linearity of the Lie bracket, swapping the order of two minibatches has an effect given by averaging over all pairs of examples.

Prior Work

When searching the literature for work on the Lie brackets of training examples, the earliest description we found was Dherin in 2023, who connects the bracket's ability to measure commutativity of updates to implicit biases in neural net training.

We go farther here by explicitly computing the bracket value at various checkpoints in the training of an actual convnet.

Experiment Details

We replicate the MXResNet architecture (without attention layers) and train it on the CelebA dataset for 5000 steps at a batch size of 32, saving weight checkpoints from time to time. The optimizer is Adam, with the following parameters:


lr = 5e-3
betas = (0.8, 0.999)

The CelebA dataset has 40 binary attributes (such as Male or Black_Hair) and the neural net is tasked with predicting each of these independently and simultaneously (averaged binary classification loss).

We evaluated each checkpoint of the model on a batch of 32 examples from the test set. We computed Lie brackets between only the first 6 of these test examples to limit disk space usage, as each individual Lie bracket has the same size as a full checkpoint of the model. For each of these brackets representing a swap of two examples, we show how all 40 logits for all 32 test examples in the batch are perturbed when the two examples are swapped.

We have some things to say about the results, but first try exploring them yourself! The slider controls which checkpoint from the training process we're examining, and you can click on the buttons to see data about particular Lie brackets. $[u_i, u_j] = -[u_j, u_i]$ so brackets across the diagonal from each other are just negatives of each other.