有效样本量

有效样本量
The Effective Sample Size

原始链接: https://alex.smola.org/posts/40-effective-sample-size/

通过数据重加权（使用权重 $\beta_i = p(x_i)/q(x_i)$）来校正协变量偏移，虽然消除了偏差，但却显著增加了方差。当权重不均匀时，少数数据点会主导估计结果，导致大部分样本实际上变得无效。这一现象可以通过**基什有效样本量（Kish’s effective sample size）**（$n_{\mathrm{eff}} = 1/\sum \alpha_i^2$）来量化，它衡量了加权样本的实际“信息含量”。无论是通过加权和的方差还是通过尾部概率的霍夫丁不等式（Hoeffding’s inequality）进行分析，结论都是一致的：加权估计量的统计表现等同于样本量为 $n_{\mathrm{eff}}$ 的无加权估计量。这一概念在离线强化学习等领域至关重要，通过监测 $n_{\mathrm{eff}}$ 可以诊断回放缓冲区（replay buffer）的“陈旧程度”。当当前策略偏离行为策略时，权重会变得集中，$n_{\mathrm{eff}}$ 随之骤降，这表明缓冲区的有效信息已耗尽。将 $n_{\mathrm{eff}}$ 视为一种诊断控制信号，有助于进行更稳健的算法调整，例如在粒子滤波中触发重采样，或在强化学习中校准更新步长。

对不起。

原文

A cartoon row of birds perched on a power line: three plump, oversized birds dominate the wire while several small birds sit between them, a visual metaphor for a few heavily weighted observations dwarfing many lightly weighted ones.

Years ago I wrote about correcting covariate shift by reweighting your data. Your features come from the wrong distribution $q$, you care about a target $p$, so you weight every observation by $\beta_i = p(x_i)/q(x_i)$ and your estimates are unbiased again. I ended that post by admitting the weights “can be quite a bit off,” and waved at fixing it another day.

Here is the more basic question I skipped. Even when the weights are exactly right, what do they cost you? Reweighting buys you less bias. It charges you variance, and the currency of that charge has a name.

Why correct weights still hurt

The intuition is in the picture below. Data from $N(0,1)$, target $N(\mu,1)$, so the correct weight is $w(x) = e^{\mu x - \mu^2/2}$. It is wildly uneven: the rare points sitting where $p$ has its mass get enormous weight, and everything else gets almost none. At a shift of only $\mu = 2$, the heaviest 1% of observations carry 37% of the total weight, and the effective fraction of usable data has already fallen below 2%.

Left: effective sample size ratio n_eff/n = e^(-mu^2) vs covariate shift mu, collapsing near zero by mu=2. Right: Lorenz-style curve showing the heaviest 1% of importance weights carrying 37% of total weight at mu=2.

When a few points dominate the estimate, the rest of the data is effectively ignored, and any error in that handful, an outlier, a mislabeled point, a slightly wrong weight, runs straight into your answer with nothing to average it out. You did not get $n$ observations. You got a few, dressed up as many. So you would like to know: how many did you actually get?

Counting what you actually got

Normalize the weights so that $\|\alpha\|_1 = \sum_i \alpha_i = 1$, and define

\[ n_{\mathrm{eff}} = \frac{1}{\|\alpha\|_2^2} = \frac{1}{\sum_i \alpha_i^2}. \]

Equal weights $\alpha_i = 1/n$ give $\sum_i \alpha_i^2 = 1/n$, hence $n_{\mathrm{eff}} = n$: nothing wasted. All the weight on one point gives $n_{\mathrm{eff}} = 1$: a sample of size one. Everything else lands in between. This is Kish’s effective sample size. The only thing left to explain is why that function is the right one. Here are two derivations, from opposite ends of the field, that both produce the same outcome.

Variance of a sum of Normal random variables

Let the $x_i$ be iid $N(0,1)$ and form the weighted average $\bar x = \sum_i \alpha_i x_i$. By independence,

\[ \mathrm{Var}[\bar x] = \sum_i \alpha_i^2 \,\mathrm{Var}[x_i] = \|\alpha\|_2^2. \]

A plain average of $m$ iid unit-variance variables has variance $1/m$. Set $1/m = \|\alpha\|_2^2$ and you get $m = \|\alpha\|_2^{-2} = n_{\mathrm{eff}}$. The weighted average is exactly as noisy as an unweighted average over $n_{\mathrm{eff}}$ fresh draws. That is the quickest route to the definition: one line of variance algebra.

Hoeffding’s inequality

Variance is an average-case statement. The same quantity controls the worst case. Let the $x_i$ be iid in $[0,1]$ with mean $\mathbb{E}[x_i]$, and take $\bar x = \sum_i \alpha_i x_i$ again. Each term lives in an interval of width $\alpha_i$, so Hoeffding’s inequality, the Chernoff bounding argument for bounded variables, gives

\[ \Pr\big(|\bar x - \mathbb{E}[\bar x]| \ge t\big) \le 2\exp\!\left(-\frac{2t^2}{\sum_i \alpha_i^2}\right) = 2\exp\!\left(-2\,n_{\mathrm{eff}}\,t^2\right). \]

The textbook bound for an equal-weight average is $2\exp(-2 n t^2)$. The two are identical except that $n$ has become $n_{\mathrm{eff}}$. Whether you measure spread by a variance or by a tail probability, the concentration of a reweighted sum is set not by how many points you have but by how many you effectively have.

Replay buffer

The effective sample size is the knob you want in off-policy reinforcement learning. A replay buffer is data collected under earlier policies, but you want to improve the policy you are running now. The correction is the same one as covariate shift: weight each stored transition by the ratio $\pi/b$ of the current policy $\pi$ to the behaviour policy $b$ that generated it. As the current policy pulls away from the buffer, those weights concentrate, and the effective sample size of the buffer, measured against the policy you actually care about, collapses along a curve like the one above.

In this case $n_{\mathrm{eff}}$ is not a number you read off after the fact. It becomes a diagnostic control signal: how much real information the buffer still holds, when the data has gone too stale to reuse, and how large an update the current batch can support. Calibrating the algorithm to its own effective sample size is exactly what P3O does, and what we implement in FeynRL. That is the next post.

There are many more applications of the effective sample size. For instance, in Sequential Monte Carlo, aka the Particle Filter, this is used as a diagnostic to decide when it’s time to resample the current distribution to obtain a more evenly weighted set of particles. But that’s a story for another day.