蒙特卡洛方法速成：采样

蒙特卡洛方法速成：采样
Monte Carlo Crash Course: Sampling

本章探讨从复杂概率分布中生成样本的技术，这对于蒙特卡洛方法至关重要。虽然在确定性计算机上无法获得真正的随机数，但伪随机数生成器 (PRNG) 提供了统计上稳健的近似值。本章详细介绍了三种采样方法：拒绝采样、逆变换采样和坐标变换采样。拒绝采样通过从包含目标域的更简单域中采样，并拒绝落在目标域之外的样本。非均匀拒绝采样将其扩展到非均匀分布，需要PDF比率的上界。逆变换采样适用于具有可逆累积分布函数 (CDF) 的一维分布。边缘逆变换采样将其扩展到更高维度，通过迭代地采样每个维度的边缘分布。最后，坐标变换采样涉及使用域的参数化来变换随机变量。通过仔细选择变换空间中的分布，可以在原始空间中获得所需的分布，从而校正变换引入的面积畸变。每种方法在效率和适用性方面都有其优缺点。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录蒙特卡洛方法速成：采样 (thenumb.at) 12 分，作者 ibobev，3 小时前 | 隐藏 | 过去 | 收藏 | 讨论加入我们 6 月 16-17 日在旧金山举办的 AI 初创公司学校！指导原则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

In the previous chapter, we assumed that we can uniformly randomly sample our domain. However, it’s not obvious how to actually do so—in fact, how can a deterministic computer even generate random numbers?

Pseudo-Random Numbers

Fortunately, Monte Carlo methods don’t need truly random numbers.^{Instead, we can use a pseudo-random number generator (PRNG).
A PRNG produces a deterministic stream of numbers that look uniformly random:}

By “look uniformly random,” we mean the sequence exhibits certain statistical properties:

Uniformity: samples are evenly distributed.
Independence: previous samples cannot be used to predict future samples.
Aperiodicity: the sequence of samples does not repeat.

Deterministic generators cannot fully achieve these properties, but can get pretty close, in a precise sense.^{Here, we will use the PCG family of generators, which are performant, small, and statistically robust.}

PRNGs give us uniformly random scalars, but we ultimately want to sample complex, high-dimensional domains. Fortunately, we can build up samplers for interesting distributions using a few simple algorithms.

Uniform Rejection Sampling

Rejection sampling transforms a sampler for a simple domain $$D$$ into a sampler for a complex domain $$\Omega$$, where $$\Omega \subseteq D$$. All we need is a function $$\text{accept}$$ that indicates whether a point $$\mathbf{x} \in D$$ is also contained in $$\Omega$$.

Let’s build a rejection sampler for the two-dimensional unit disk. First, we’ll choose $$D = [-1,1]\times[-1,1]$$, which clearly encloses $$\Omega$$. We may use a PRNG to produce a sequence of uniform samples of $$[-1,1]$$, denoted as $$\xi_i$$. Taking each pair $$D_i = (\xi_{2i},\xi_{2i+1})$$ then provides samples of $$D$$.

Second, we’ll define $$\text{accept}(\mathbf{x})$$—for the unit disk, we may check that $$||\mathbf{x}|| \le 1$$. Now, the rejection sampler:

def Ω():
    x = D()
    if accept(x):
        return x
    return Ω()

In other words, sample $$D$$, and if the result is not in $$\Omega$$, just try again!

Intuitively, rejection sampling filters out samples that aren’t in $$\Omega$$. Hence, if we start with uniform samples of $$D$$, we should be left with uniform samples of $$\Omega$$.

To formalize our reasoning, let’s derive our sampler’s PDF, denoted $$f_\text{rej}$$. To produce a sample $$\mathbf{x}$$, we must first sample it from $$f_D$$, then accept it. Therefore $$f_\text{rej}(\mathbf{x})$$ is equivalent to $$f_D(\mathbf{x})$$ condition on $$\mathbf{x}$$ being accepted.

\[ \begin{align*} f_\text{rej}(\mathbf{x}) &= f_{D\ |\ \text{accept}}(\mathbf{x}) \\ &= \frac{\mathbb{P}\left\{\text{accept}\ |\ \mathbf{x}\right\}f_{D}(\mathbf{x})}{\mathbb{P\left\{\text{accept}\right\}}} \tag{Bayes' rule}\\ &= \frac{1\cdot \frac{1}{\text{Vol}(D)}}{\frac{\text{Vol}(\Omega)}{\text{Vol}(D)}} = \frac{1}{\text{Vol}(\Omega)} \tag{$f_D$ is uniform} \end{align*} \]

\[\begin{align*} f_\text{rej}(\mathbf{x}) &= f_{D\ |\ \text{accept}}(\mathbf{x}) \\ &= \frac{\mathbb{P}\left\{\text{accept}\ |\ \mathbf{x}\right\}f_{D}(\mathbf{x})}{\mathbb{P\left\{\text{accept}\right\}}} \\&\tag{Bayes' rule}\\ &= \frac{1\cdot \frac{1}{\text{Vol}(D)}}{\frac{\text{Vol}(\Omega)}{\text{Vol}(D)}} = \frac{1}{\text{Vol}(\Omega)} \\&\tag{$f_D$ is uniform} \end{align*}\]

$$\text{Vol}$$ indicates the volume (in two dimensions, area) of a domain. Therefore $$f_\text{rej}$$ is indeed uniform on $$\Omega$$.

Non-Uniform Rejection Sampling

Like we saw with Monte Carlo integration, rejection sampling can be straightforwardly extended to work with non-uniform distributions.

Let’s say the PDF of our distribution on $$D$$ is $$f_D(\mathbf{x})$$, and we want to use it to sample from $$\Omega$$ with PDF $$f_\Omega(\mathbf{x})$$. We already know that $$\Omega \subseteq D$$, but we’ll also need to check a slightly stricter condition—that the ratio between our PDFs has a finite upper bound, denoted $$c$$.

\[ \begin{align*} c = \sup_{\mathbf{x}\in\Omega}\frac{f_\Omega(\mathbf{x})}{f_D(\mathbf{x})} \end{align*} \]

Above, we required $$\Omega \subseteq D$$ because it would otherwise be impossible to sample all parts of $$\Omega$$. Here, we need a finite $$c$$ for essentially the same reason—we’re checking that there is no part of $$\Omega$$ that we sample infinitely infrequently.

Once we have $$c$$, we just need to update $$\text{accept}$$. Now, we will accept a sample $$\mathbf{x}$$ with probability $$\frac{f_\Omega(\mathbf{x})}{cf_D(\mathbf{x})}$$, which is always at most $$1$$.

def accept(x):
    return random(0, 1) < f_Ω(x) / (c * f_D(x))

Intuitively, we’re transforming the probability density at $$\mathbf{x}$$ from $$f_D(\mathbf{x})$$ to $$f_\Omega(\mathbf{x})$$ by accepting $$\mathbf{x}$$ with probability proportional to $$\frac{f_\Omega(\mathbf{x})}{f_D(\mathbf{x})}$$. Note that if $$f_D$$ is uniform, we directly accept $$\mathbf{x}$$ with probability proportional to $$f_\Omega(\mathbf{x})$$.

For example, given uniform $$f_D$$ and $$f_\Omega(\mathbf{x}) \propto \frac{1}{1+||\mathbf{x}||^2}$$:

As you’d expect, we see a greater proportion of accepted samples towards the center, where $$f_\Omega(\mathbf{x})$$ is largest.

Finally, let’s check that our sampler’s PDF is actually $$f_\Omega(\mathbf{x})$$. Like above, the PDF is equivalent to $$f_D(\mathbf{x})$$ condition on $$\mathbf{x}$$ being accepted.

\[ \begin{align*} f_\text{rej}(\mathbf{x}) &= f_{D\ |\ \text{accept}}(\mathbf{x}) \\ &= \frac{\mathbb{P}\left\{\text{accept}\ |\ \mathbf{x}\right\}f_{D}(\mathbf{x})}{\mathbb{P}\left\{\text{accept}\right\}} \tag{Bayes' rule}\\ &= \frac{\frac{f_\Omega(\mathbf{x})}{cf_D(\mathbf{x})} f_{D}(\mathbf{x})}{\int_D \frac{f_\Omega(\mathbf{x})}{cf_D(\mathbf{x})} f_D(\mathbf{x}) \, d\mathbf{x}} \tag{$\mathbb{P}\left\{\text{accept}\right\} = \mathbb{E}\left[\frac{f_\Omega(\mathbf{x})}{cf_D(\mathbf{x})}\right]$}\\ &= \frac{f_\Omega(\mathbf{x})}{\int_D f_\Omega(\mathbf{x}) \, d\mathbf{x}} \tag{Algebra}\\ &= f_\Omega(\mathbf{x}) \tag{$\int_{D\supseteq\Omega} f_\Omega = 1$} \end{align*} \]

\[\begin{align*} f_\text{rej}(\mathbf{x}) &= f_{D\ |\ \text{accept}}(\mathbf{x}) \\ &= \frac{\mathbb{P}\left\{\text{accept}\ |\ \mathbf{x}\right\}f_{D}(\mathbf{x})}{\mathbb{P}\left\{\text{accept}\right\}} \\&\tag{Bayes' rule}\\ &= \frac{\frac{f_\Omega(\mathbf{x})}{cf_D(\mathbf{x})} f_{D}(\mathbf{x})}{\int_D \frac{f_\Omega(\mathbf{x})}{cf_D(\mathbf{x})} f_D(\mathbf{x}) \, d\mathbf{x}} \\&\tag{$\mathbb{P}\left\{\text{accept}\right\} = \mathbb{E}\left[\frac{f_\Omega(\mathbf{x})}{cf_D(\mathbf{x})}\right]$}\\ &= \frac{f_\Omega(\mathbf{x})}{\int_D f_\Omega(\mathbf{x}) \, d\mathbf{x}} \\&\tag{Algebra}\\ &= f_\Omega(\mathbf{x}) \\&\tag{$\int_{D\supseteq\Omega} f_\Omega = 1$} \end{align*}\]

In the second step, we obtain the probability of accepting an arbitrary sample by computing the expected probability of accepting $$\mathbf{x}$$ over all $$\mathbf{x} \in D$$. In the fourth, note that we define $$f_\Omega = 0$$ outside of $$\Omega$$.

Sample Efficiency

Many practical problems can be solved using only rejection sampling and uniform Monte Carlo integration. Choosing $$D$$ to be a box enclosing $$\Omega$$ works in any number of dimensions—boxes are always easy to sample, as every dimension is independent.

However, rejection sampling is only efficient when $$f_\Omega$$ can make use of a significant proportion of the probability density in $$f_D$$. Each sample of $$f_\Omega$$ requires a geometric number of samples of $$f_D$$, distributed according to $$\mathbb{P}\{\text{accept}\}$$:

\[\begin{align*} \mathbb{P}\left\{\text{accept}\right\} &= \mathbb{E}\left[\frac{f_\Omega(\mathbf{x})}{cf_D(\mathbf{x})}\right] \\ &= \int_D \frac{f_\Omega(\mathbf{x})}{cf_D(\mathbf{x})} f_D(\mathbf{x}) \, d\mathbf{x} \\ &= \frac{1}{c}\int_D f_\Omega(\mathbf{x})\, d\mathbf{x} = \frac{1}{c} \end{align*}\]

Since we have $$\frac{1}{c}$$ chance of accepting each sample, we should expect to generate $$c$$ samples of $$f_D$$ for each sample of $$f_\Omega$$. Intuitively, when $$c$$ is large, it means $$f_D$$ rarely samples regions that $$f_\Omega$$ samples frequently.

For example, you may not want to use rejection sampling when $$\Omega$$ doesn’t cover much of $$D$$:

So, we’ll need to devise a more efficient sampling algorithm.

Inversion Sampling

Inversion sampling is a method for sampling any one-dimensional distribution with an invertible cumulative distribution function (CDF). The CDF of a random variable $$X$$, denoted as $$F_X(x)$$, measures the probability that a sample is less than $$x$$.

\[\begin{align*} F_X(x) &= \mathbb{P}\{X

Intuitively, the CDF maps $$x$$ to the percentage of probability mass lying below $$x$$:

Hence, the inverse CDF $$F^{-1}_X(p)$$ maps a percentage of probability mass to the corresponding $$x$$.

We may define an inversion sampler $$\text{Inv}$$ by uniformly sampling $$p$$ from $$[0,1]$$ and computing $$F_X^{-1}(p)$$. To characterize our sampler’s behavior, we can find its PDF. The probability that $$\text{Inv} = F_X^{-1}(p)$$ falls within a range of outcomes $$dx$$ is equivalent to the probability that $$p$$ falls within the corresponding range $$dp$$:

We sampled $$p$$ uniformly, so the probability that $$p$$ falls in $$dp$$ is the length of $$dp$$. The average probability density on $$dx$$ is then $$\frac{dp}{dx}$$. In the limit, the length of $$dp$$ is proportional to slope of $$F_X$$—this ratio is its derivative!

\[\begin{align*} f_\text{Inv}(x) &= \frac{dp}{dx}\\ &= \frac{dF_X(x)}{dx} \tag{$p = F_X(x)$}\\ &= f_X(x) \end{align*}\]

\[\begin{align*} f_\text{Inv}(x) &= \frac{dp}{dx}\\ &= \frac{dF_X(x)}{dx}\\& \tag{$p = F_X(x)$}\\ &= f_X(x) \end{align*}\]

Therefore, we have $$f_\text{Inv} = f_X$$.^{That implies inversion sampling works, but more rigorously, we can check that the CDF of our sampler matches $$F_X$$.}

\[ \begin{align*} F_\text{Inv}(x) &= \mathbb{P}\{\text{Inv}

\[\begin{align*} F_\text{Inv}(x) &= \mathbb{P}\{\text{Inv}

Since their CDFs are equivalent, we indeed have $$\text{Inv} \sim X$$.

Marginal Inversion Sampling

As stated, inversion sampling only applies to one-dimensional distributions. Fortunately, we can extend inversion to higher dimensions by iteratively sampling each dimension’s marginal distribution.

Let’s derive an inversion sampler for the two-dimensional distribution $$f_{XY}(x,y)$$. First, we’ll define the marginal distribution $$f_X(x)$$, which computes the total probability density at $$x$$ across all choices for $$y$$.

This distribution is one-dimensional, so we can use inversion sampling to choose a sample $$X$$. Second, we’ll compute the marginal distribution $$f_Y(y)$$ condition on $$X$$, which must be proportional to $$f_{XY}(X,y)$$.

\[\begin{align*} f_{Y\ |\ X}(y) &= \frac{f(X,y)}{\int_{-\infty}^\infty f(X,y)\, dy}\\ &= \frac{f(X,y)}{f_X(X)} \end{align*}\]

Finally, we can apply inversion again to sample $$Y$$. Intuitively, $$f_{Y\ |\ X}$$ selects the correct distribution for $$y$$ given our choice of $$X$$. We will more rigorously explore why inversion sampling works in the next section.

Changes of Coordinates

While marginal inversion sampling can build up arbitrarily high-dimensional distributions, it’s often not necessary in practice. That’s because inversion sampling is a special case of a more general technique for transforming random variables.

To illustrate, let’s attempt to define a uniform sampler for the unit disk. Unlike rejection sampling, we first need to choose a parameterization of our domain. A natural choice is polar coordinates, where $$\theta$$ is angle with respect to the the x-axis and $$r$$ is distance from the origin.

\[\begin{align*} \Phi(r,\theta) &= (r\cos\theta, r\sin\theta)\\ \Phi^{-1}(x,y) &= (\sqrt{x^2+y^2}, \text{atan2}(y,x)) \end{align*}\]

The unit disk is hence described by $$\mathcal{S} = \Phi(\mathcal{R})$$, where $$\mathcal{R} = [0,1]\times[0,2\pi]$$. To produce a sampler for $$\mathcal{S}$$, we could try mapping uniform samples of $$\mathcal{R}$$ onto $$\mathcal{S}$$:

But uniform samples of $$\mathcal{R}$$ don’t become uniform samples of $$\mathcal{S}$$. That’s because transforming from polar to rectangular coordinates didn’t preserve area—smaller radii contain less area, yet we weighted all $$r$$ equally.

To determine what went wrong, let’s find the PDF of this sampler. The key observation is that a sample $$\mathbf{s}$$ falls within a circular patch $$d\mathcal{S}$$ if and only if $$\mathbf{r} = \Phi^{-1}(\mathbf{s})$$ falls within the corresponding rectangle $$d\mathcal{R}$$.

Hence, the probabilities of sampling either region must be equivalent.

\[ \int_{d\mathcal{S}} f_\mathcal{S}(\mathbf{s})\, d\mathbf{s} = \int_{d\mathcal{R}} f_\mathcal{R}(\mathbf{r})\, d\mathbf{r} \]

In the limit, these integrals reduce to the respective PDF times the area of the patch.

\[\begin{align*} && f_\mathcal{S}(\mathbf{s})\cdot|d\mathcal{S}| &= f_\mathcal{R}(\mathbf{r})\cdot|d\mathcal{R}|\\ &\implies& f_\mathcal{S}(\mathbf{s}) &= f_\mathcal{R}(\mathbf{r})\cdot\frac{|d\mathcal{R}|}{|d\mathcal{S}|} \end{align*}\]

Intuitively, the ratio of areas $$\frac{|d\mathcal{R}|}{|d\mathcal{S}|}$$ tells us how much $$d\mathcal{S}$$ is squashed or stretched when mapped onto $$d\mathcal{R}$$. For example, if $$d\mathcal{S}$$ is scaled down by a factor of two, $$d\mathcal{R}$$ must contain twice the probability density.

Finally, since $$\Phi^{-1}$$ maps $$\mathcal{S}$$ to $$\mathcal{R}$$, the area scaling factor is given by its derivative:

\[\begin{align*} f_\mathcal{S}(\mathbf{s}) &= f_\mathcal{R}(\mathbf{r})\cdot\frac{|d\mathcal{R}|}{|d\mathcal{S}|} \\ &= f_\mathcal{R}(\Phi^{-1}(\mathbf{s}))\cdot\left|\frac{d\Phi^{-1}(\mathcal{S})}{d\mathcal{S}}\right| \tag{$\mathcal{R} = \Phi^{-1}(\mathcal{S})$}\\ &=f_\mathcal{R}(\Phi^{-1}(\mathbf{s}))\cdot |D\Phi^{-1}| \end{align*}\]

\[\begin{align*} f_\mathcal{S}(\mathbf{s}) &= f_\mathcal{R}(\mathbf{r})\cdot\frac{|d\mathcal{R}|}{|d\mathcal{S}|} \\ &= f_\mathcal{R}(\Phi^{-1}(\mathbf{s}))\cdot\left|\frac{d\Phi^{-1}(\mathcal{S})}{d\mathcal{S}}\right|\\& \tag{$\mathcal{R} = \Phi^{-1}(\mathcal{S})$}\\ &=f_\mathcal{R}(\Phi^{-1}(\mathbf{s}))\cdot |D\Phi^{-1}| \end{align*}\]

Where $$|D\Phi^{-1}|$$ denotes the determinant of the Jacobian of $$\Phi^{-1}$$.

Sampling via Change of Coordinates

Now that we know the relationship between $$f_\mathcal{S}$$ and $$f_\mathcal{R}$$, we can choose a different, non-uniform $$f_\mathcal{R}$$ that will produce a uniform $$f_\mathcal{S}$$. Our new PDF will need to cancel out the factor of $$|D\Phi^{-1}|$$:

\[\begin{align*} |D\Phi^{-1}| &= \left|\begin{bmatrix}\frac{\delta r}{\delta x} & \frac{\delta \theta}{\delta x}\\ \frac{\delta r}{\delta y} & \frac{\delta \theta}{\delta y}\end{bmatrix}\right|\\ &= \frac{\delta r}{\delta x}\cdot\frac{\delta \theta}{\delta y} - \frac{\delta \theta}{\delta x}\cdot\frac{\delta r}{\delta y} \\ &= \frac{x}{\sqrt{x^2+y^2}} \cdot \frac{x}{x^2+y^2} + \frac{y}{x^2+y^2} \cdot \frac{y}{\sqrt{x^2+y^2}}\\ &= \frac{1}{\sqrt{x^2+y^2}} = \frac{1}{r} \end{align*}\]

Proportionality with $$\frac{1}{r}$$ makes sense—our misbehaving sampler produced too many samples near the origin. If we instead sample $$\mathcal{R}$$ according to $$f_\mathcal{R}(r,\theta) = \frac{r}{2\pi}$$, we’ll end up with a uniform $$f_\mathcal{S}$$.

\[\begin{align*} f_\mathcal{S}(x,y) &= f_\mathcal{R}(\Phi^{-1}(x,y))\cdot|D\Phi^{-1}| \\ &= \frac{r}{2\pi}\cdot \frac{1}{r} = \frac{1}{2\pi} \end{align*}\]

In the previous section, we applied a change of coordinates in one dimension. That is, by taking $$\Phi = F_X^{-1}$$, we transformed the uniform unit distribution to have our desired PDF.

\[\begin{align*} f_{\text{Inv}}(x) &= f_U(F_X^{-1}(x)) \cdot \left|D\left(F_X^{-1}\right)^{-1}\right|\\ &= 1 \cdot |DF_X(x)|\\ &= f_X(x) \end{align*}\]

In practice, many useful distributions can be efficiently sampled via the proper change of coordinates. However, doing so requires a parameterization of the domain, which is sometimes infeasible to construct. In such cases, we may turn to methods like Markov Chain Monte Carlo, to be discussed in a future chapter.

Written on April 12, 2025