连续型隐变量模型与变分自编码器

Open In Colab

Sources:

第 1 部分 本系列文章介绍了具有离散隐变量的隐变量模型、高斯混合模型 (GMM) 和拟合算法这个模型要数据,EM 算法。第 2 部分介绍了具有连续隐变量的隐变量模型,用于对更复杂的数据(例如自然图像)进行建模,以及可与随机优化算法结合使用的贝叶斯推理技术。

Consider a natural image of size 100×100100 \times 100 with a single channel. This image is a point in 10.00010.000-dimensional space. Natural images are usually not uniformly distributed in this space but reside on a much lower-dimensional manifold within this high-dimensional space. The lower dimensionality of the manifold is related to the limited degrees of freedom in these images e.g. only a limited number of pixel value combinations are actually perceived as natural images.

考虑一个大小为 100×100100 \times 100 的自然图像,带有单个通道。此图像是 1000010000 维空间中的一个点。自然图像通常不是均匀分布在这个空间中,而是驻留在这个高维空间内的一个低得多的流形上。流形的较低维度与这些图像中有限的自由度有关,例如只有有限数量的像素值组合实际上被视为自然图像。

Modeling natural images with latent variable models whose continuous latent variables represent locations on the manifold can be a useful approach that is also discussed here. As in part 1, a model with one latent variable ti\mathbf{t}_i per observation xi\mathbf{x}_i is used but now the latent variables are continuous rather than discrete variables. Therefore, summations over latent variable states are now replaced by integrals and these are often intractable for more complex models.

使用潜在变量模型对自然图像进行建模,其连续的潜在变量表示流形上的位置,这也是一种有用的方法,这里也将讨论。与第 1 部分一样,使用每个观察值具有一个潜在变量 ti\mathbf{t}_i 的模型 xi\mathbf{x}_i 但现在潜在变量是连续变量而不是离散变量。因此,对潜在变量状态的求和现在被积分所取代,对于更复杂的模型来说,这些通常是难以处理的。

Observations i.e. images X=x1,,xN\mathbf{X} = \\{ \mathbf{x}_1, \ldots, \mathbf{x}_N \\} are again described with a probabilistic model p(xθ)p(\mathbf{x} \lvert \boldsymbol{\theta}). Goal is to maximize the data likelihood p(Xθ)p(\mathbf{X} \lvert \boldsymbol{\theta}) w.r.t. θ\boldsymbol{\theta} and to obtain approximate posterior distributions over continuous latent variables. The joint distribution over an observed variable x\mathbf{x} and a latent variable t\mathbf{t} is defined as the product of the conditional distribution over x\mathbf{x} given t\mathbf{t} and the prior distribution over t\mathbf{t}.

观测( 即图像 X={x1,,xN}\mathbf{X} = \{ \mathbf{x}_1, \ldots, \mathbf{x}_N \} ) 再次用概率模型 p(xθ)p(\mathbf{x} \lvert \boldsymbol{\theta})。目标是最大化数据似然 p(Xθ)p(\mathbf{X} \lvert \boldsymbol{\theta}) w.r.t. θ\boldsymbol{\theta} 并获得连续潜在变量的近似后验分布。观察变量 x\mathbf{x} 和潜在变量 t\mathbf{t} 上的联合分布被定义为给定 t\mathbf{t}x\mathbf{x} 的条件分布的乘积以及 t\mathbf{t} 上的先验分布。

p(x,tθ)=p(xt,θ)p(tθ)(1)p(\mathbf{x}, \mathbf{t} \lvert \boldsymbol{\theta}) = p(\mathbf{x} \lvert \mathbf{t}, \boldsymbol{\theta}) p(\mathbf{t} \lvert \boldsymbol{\theta}) \tag{1}

We obtain the marginal distribution over x by integrating over t.
我们通过在 t 上积分来获得 x 上的边际分布。

p(xθ)=p(xt,θ)p(tθ)dt(2)p(\mathbf{x} \lvert \boldsymbol{\theta}) = \int p(\mathbf{x} \lvert \mathbf{t}, \boldsymbol{\theta}) p(\mathbf{t} \lvert \boldsymbol{\theta}) d\mathbf{t} \tag{2}

This integral is usually intractable for even moderately complex conditional probabilities p(xt,θ)p(\mathbf{x} \lvert \mathbf{t}, \boldsymbol{\theta}) and consequently also the true posterior.

p(tx,θ)=p(xt,θ)p(tθ)p(xθ)(3)p(\mathbf{t} \lvert \mathbf{x}, \boldsymbol{\theta}) = {p(\mathbf{x} \lvert \mathbf{t}, \boldsymbol{\theta}) p(\mathbf{t} \lvert \boldsymbol{\theta}) \over p(\mathbf{x} \lvert \boldsymbol{\theta})} \tag{3}

This means that the E-step of the EM algorithm becomes intractable. Recall from part 1 that the lower bound of the log marginal likelihood is given by
即使对于中等复杂的条件概率 p(xt,θ)p(\mathbf{x} \lvert \mathbf{t}, \boldsymbol{\theta}),这个积分通常也是难以处理的,因此也是真正的后验概率。

L(θ,q)=logp(Xθ)KL(q(TX)p(TX,θ))(4)\mathcal{L}(\boldsymbol{\theta}, q) = \log p(\mathbf{X} \lvert \boldsymbol{\theta}) - \mathrm{KL}(q(\mathbf{T} \lvert \mathbf{X}) \mid\mid p(\mathbf{T} \lvert \mathbf{X}, \boldsymbol{\theta})) \tag{4}

In the E-step, the lower bound is maximized w.r.t. qq and θ\boldsymbol{\theta} is held fixed. If the true posterior is tractable, we can set qq to the true posterior so that the KL divergence becomes 00 which maximizes the lower bound for the current value of θ\boldsymbol{\theta}. If the true posterior is intractable approximations must be used.
在 E-step 中,下界被最大化 w.r.t. qqθ\boldsymbol{\theta} 保持固定。如果真实后验是可处理的,我们可以将 qq 设置为真实后验,以便 KL 散度变为 00,从而最大化 θ\boldsymbol{\theta} 当前值的下限。如果真正的后验是棘手的,则必须使用近似值。

Here, we will use stochastic variational inference, a Bayesian inference method that also scales to large datasets[1]. Numerous other approximate inference approaches exist but these are not discussed here to keep the article focused.
在这里,我们将使用随机变分推理,这是一种贝叶斯推理方法,也适用于大型数据集[1]。存在许多其他近似推理方法,但为了保持本文的重点,这里不讨论这些方法。

随机变分推断

The field of mathematics that covers the optimization of a functional w.r.t. a function, like argmaxqL(θ,q){\mathrm{argmax}}_q \mathcal{L}(\boldsymbol{\theta}, q) in our example, is the calculus of variations, hence the name variational inference. In this context, qq is called a variational distribution and L(θ,q)\mathcal{L}(\boldsymbol{\theta}, q) a variational lower bound.

涵盖函数 w.r.t. 优化的数学领域。在我们的例子中,像 argmaxqL(θ,q){\mathrm{argmax}}_q \mathcal{L}(\boldsymbol{\theta}, q) 这样的函数是[变体演算](https://en.wikipedia.org /wiki/Calculus_of_variations),因此得名变分推理。在这种情况下,qq 被称为变分分布L(θ,q)\mathcal{L}(\boldsymbol{\theta}, q) 被称为变分下界

We will approximate the true posterior with a parametric variational distribution q(tx,ϕ)q(\mathbf{t} \lvert \mathbf{x}, \boldsymbol{\phi}) and try to find a value of ϕ\boldsymbol{\phi} that minimizes the KL divergence between this distribution and the true posterior. Using q(tx,ϕ)q(\mathbf{t} \lvert \mathbf{x}, \boldsymbol{\phi}) we can formulate the variational lower bound for a single observation xi\mathbf{x}_i as

我们将用参数变分分布 q(tx,ϕ)q(\mathbf{t} \lvert \mathbf{x}, \boldsymbol{\phi}) 来近似真实后验,并尝试找到 ϕ\boldsymbol{\phi} 的值最小化该分布与真实后验分布之间的 KL 散度。使用 q(tx,ϕ)q(\mathbf{t} \lvert \mathbf{x}, \boldsymbol{\phi}) 我们可以将单个观测值的变分下界 xi\mathbf{x}_i 公式化为

L(θ,q;xi)=logp(xiθ)KL(q(tixi,ϕ)p(tixi,θ))=logp(xiθ)q(tixi,ϕ)logq(tixi,ϕ)p(tixi,θ)dti=q(tixi,ϕ)logp(xiθ)p(tixi,θ)q(tixi,ϕ)dti=q(tixi,ϕ)logp(xiti,θ)p(tiθ)q(tixi,ϕ)dti=q(tixi,ϕ)logp(xiti,θ)dtiKL(q(tixi,ϕ)p(tiθ))=Eq(tixi,ϕ)logp(xiti,θ)KL(q(tixi,ϕ)p(tiθ))(5)\begin{align*} \mathcal{L}(\boldsymbol{\theta}, q; \mathbf{x}_i) &= \log p(\mathbf{x}_i \lvert \boldsymbol{\theta}) - \mathrm{KL}(q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \mid\mid p(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\theta})) \\ &= \log p(\mathbf{x}_i \lvert \boldsymbol{\theta}) - \int q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \log {q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \over p(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\theta})} d\mathbf{t}_i \\ &= \int q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \log {p(\mathbf{x}_i \lvert \boldsymbol{\theta}) p(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\theta}) \over q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})} d\mathbf{t}_i \\ &= \int q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \log {p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) p(\mathbf{t}_i \lvert \boldsymbol{\theta}) \over q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})} d\mathbf{t}_i \\ &= \int q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) d\mathbf{t}_i - \mathrm{KL}(q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \mid\mid p(\mathbf{t}_i \lvert \boldsymbol{\theta})) \\ &= \mathbb{E}_{q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})} \log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) - \mathrm{KL}(q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \mid\mid p(\mathbf{t}_i \lvert \boldsymbol{\theta})) \end{align*} \tag{5}

We assume that the integral q(tixi,ϕ)logp(xiti,θ)dti\int q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) d\mathbf{t}_i is intractable but we can choose a functional form of q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) from which we can easily sample so that the expectation of logp(xiti,θ)\log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) w.r.t. to q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) can be approximated with LL samples from qq.

我们假设积分 q(tixi,ϕ)logp(xiti,θ)dti\int q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \log p(\mathbf{x}_i \lvert \mathbf{t}_i , \boldsymbol{\theta}) d\mathbf{t}_i 是棘手的,但我们可以选择 q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi} ) 从中我们可以很容易地采样,使得 logp(xiti,θ)\log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) wrt 的期望到 q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) 可以用来自 qqLL 样本近似。

L(θ,q;xi)1Ll=1Llogp(xiti,l,θ)KL(q(tixi,ϕ)p(tiθ))(6)\mathcal{L}(\boldsymbol{\theta}, q; \mathbf{x}_i) \approx {1 \over L} \sum_{l=1}^L \log p(\mathbf{x}_i \lvert \mathbf{t}_{i,l}, \boldsymbol{\theta}) - \mathrm{KL}(q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) \mid\mid p(\mathbf{t}_i \lvert \boldsymbol{\theta})) \tag{6}

where ti,lq(tixi,ϕ)\mathbf{t}_{i,l} \sim q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}). We will also choose the functional form of q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) and p(tiθ)p(\mathbf{t}_i \lvert \boldsymbol{\theta}) such that integration of the KL divergence can be done analytically, hence, no samples are needed to evaluate the KL divergence. With these choices, an approximate evaluation of the variational lower bound is possible. But in order to optimize the lower bound w.r.t. θ\boldsymbol{\theta} and ϕ\boldsymbol{\phi} we need to approximate the gradients w.r.t. these parameters.

其中 ti,lq(tixi,ϕ)\mathbf{t}_{i,l} \sim q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})。我们还将选择 q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})p(ti theta)p(\mathbf{t}_i \lvert \boldsymbol{\ theta}) 使得 KL 散度的积分可以通过分析完成,因此,不需要样本来评估 KL 散度。通过这些选择,可以对变分下界进行近似评估。但是为了优化下界 w.r.t. θ\boldsymbol{\theta}ϕ\boldsymbol{\phi} 我们需要近似梯度 w.r.t.这些参数。

Stochastic gradients

随机梯度

We first assume that the analytical expression of the KL divergence, the second term on the RHS of Eq. (5)(5), is differentiable w.r.t. ϕ\boldsymbol{\phi} and θ\boldsymbol{\theta} so that deterministic gradients can be computed. The gradient of the first term on the RHS of Eq. (5)(5) w.r.t. θ\boldsymbol{\theta} is

我们首先假设 KL 散度的解析表达式,方程的 RHS 上的第二项。 (5)(5), 是可微的 w.r.t. ϕ\boldsymbol{\phi}θ\boldsymbol{\theta} 以便可以计算确定性梯度。方程的 RHS 上第一项的梯度。 (5)(5) w.r.t. θ\boldsymbol{\theta}

θEq(tixi,ϕ)logp(xiti,θ)=Eq(tixi,ϕ)θlogp(xiti,θ)(7)\nabla_{\boldsymbol{\theta}} \mathbb{E}_{q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})} \log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) = \mathbb{E}_{q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})} \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) \tag{7}

Here, θ\nabla_{\boldsymbol{\theta}} can be moved inside the expectation as q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) doesn’t depend on θ\boldsymbol{\theta}. Assuming that p(xiti,θ)p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) is differentiable w.r.t. θ\boldsymbol{\theta}, unbiased estimates of the gradient can be obtained by sampling from q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}).

这里, θ\nabla_{\boldsymbol{\theta}} 可以在期望内移动,因为 q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) 不会取决于 θ\boldsymbol{\theta}。假设 p(xiti,θ)p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) 是可微的 w.r.t. θ\boldsymbol{\theta},可以通过从 q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) 中采样获得梯度的无偏估计。

θEq(tixi,ϕ)logp(xiti,θ)1Ll=1Lθlogp(xiti,l,θ)(8)\nabla_{\boldsymbol{\theta}} \mathbb{E}_{q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})} \log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) \approx {1 \over L} \sum_{l=1}^L \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}_i \lvert \mathbf{t}_{i,l}, \boldsymbol{\theta}) \tag{8}

We will later implement p(x_it_i,θ)p(\mathbf{x}\_i \lvert \mathbf{t}\_i, \boldsymbol{\theta}) as neural network and use Tensorflow to compute θlogp(x_it_i,l,θ)\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}\_i \lvert \mathbf{t}\_{i,l}, \boldsymbol{\theta}). The gradient w.r.t. ϕ\boldsymbol{\phi} is a bit more tricky as ϕ\nabla_{\boldsymbol{\phi}} cannot be moved inside the expectation because q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) depends on ϕ\boldsymbol{\phi}. But if we can decompose q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) into an auxiliary distribution p(ϵ)p(\boldsymbol\epsilon) that doesn’t depend on ϕ\boldsymbol{\phi} and a deterministic, differentiable function g(ϵ,x,ϕ)g(\boldsymbol\epsilon, \mathbf{x}, \boldsymbol{\phi}) where ti=g(ϵ,xi,ϕ)\mathbf{t}_i = g(\boldsymbol\epsilon, \mathbf{x}_i, \boldsymbol{\phi}) and ϵp(ϵ)\boldsymbol\epsilon \sim p(\boldsymbol\epsilon) then we can re-formulate the gradient w.r.t. ϕ\boldsymbol{\phi} as

我们稍后将实现 p(x_it_i,θ)p(\mathbf{x}\_i \lvert \mathbf{t}\_i, \boldsymbol{\theta}) 作为神经网络并使用 Tensorflow 计算 θlogp(x_it_i,l,θ)\nabla_{\boldsymbol{\theta} } \log p(\mathbf{x}\_i \lvert \mathbf{t}\_{i,l}, \boldsymbol{\theta})。梯度 w.r.t. ϕ\boldsymbol{\phi} 有点棘手,因为 ϕ\nabla_{\boldsymbol{\phi}} 不能在期望范围内移动,因为 q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i , \boldsymbol{\phi}) 取决于 ϕ\boldsymbol{\phi}。但是如果我们可以将 q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) 分解为不依赖于的辅助分布 p(ϵ)p(\boldsymbol\epsilon) ϕ\boldsymbol{\phi} 和确定性的可微函数 g(ϵ,x,ϕ)g(\boldsymbol\epsilon, \mathbf{x}, \boldsymbol{\phi}) 其中 ti=g( epsilon,xi,ϕ)\mathbf{t}_i = g(\boldsymbol\ epsilon, \mathbf{x}_i, \boldsymbol{\phi})ϵp(ϵ)\boldsymbol\epsilon \sim p(\boldsymbol\epsilon) 然后我们可以重新制定梯度 wrt ϕ\boldsymbol{\phi} 作为

ϕEq(tixi,ϕ)logp(xiti,θ)=ϕEp(ϵ)logp(xig(ϵ,xi,ϕ),θ)=Ep(ϵ)ϕlogp(xig(ϵ,xi,ϕ),θ)\begin{align*} \nabla_{\boldsymbol{\phi}} \mathbb{E}_{q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})} \log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) &= \nabla_{\boldsymbol{\phi}} \mathbb{E}_{p(\boldsymbol\epsilon)} \log p(\mathbf{x}_i \lvert g(\boldsymbol\epsilon, \mathbf{x}_i, \boldsymbol{\phi}), \boldsymbol{\theta}) \\ &= \mathbb{E}_{p(\boldsymbol\epsilon)} \nabla_{\boldsymbol{\phi}} \log p(\mathbf{x}_i \lvert g(\boldsymbol\epsilon, \mathbf{x}_i, \boldsymbol{\phi}), \boldsymbol{\theta}) \tag{9} \end{align*}

Unbiased estimates of the gradient w.r.t. ϕ\boldsymbol{\phi} can then be obtained by sampling from p(ϵ)p(\boldsymbol\epsilon).

梯度 w.r.t. 的无偏估计ϕ\boldsymbol{\phi} 然后可以通过从 p(ϵ)p(\boldsymbol\epsilon) 中采样得到。

ϕEq(tixi,ϕ)logp(xiti,θ)1Ll=1Lϕlogp(xiti,l,θ)(10)\nabla_{\boldsymbol{\phi}} \mathbb{E}_{q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi})} \log p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) \approx {1 \over L} \sum_{l=1}^L \nabla_{\boldsymbol{\phi}} \log p(\mathbf{x}_i \lvert \mathbf{t}_{i,l}, \boldsymbol{\theta}) \tag{10}

where ti,l=g(ϵl,xi,ϕ)\mathbf{t}_{i,l} = g(\boldsymbol\epsilon_l, \mathbf{x}_i, \boldsymbol{\phi}) and ϵlp(ϵ)\boldsymbol\epsilon_l \sim p(\boldsymbol\epsilon). This so-called reparameterization trick can be applied to a wide range of probability distributions, including Gaussian distributions. Furthermore, stochastic gradients w.r.t. ϕ\boldsymbol{\phi} obtained with this trick have much smaller variance than those obtained with alternative approaches (not shown here).

其中 ti,l=g(ϵl,xi,ϕ)\mathbf{t}_{i,l} = g(\boldsymbol\epsilon_l, \mathbf{x}_i, \boldsymbol{\phi})ϵlp(ϵ)\boldsymbol\epsilon_l \sim p(\boldsymbol\epsilon )。这种所谓的重新参数化技巧可以应用于各种概率分布,包括高斯分布。此外,随机梯度 w.r.t.使用此技巧获得的 ϕ\boldsymbol{\phi} 的方差比使用其他方法获得的方差小得多(此处未显示)。

Mini-batches

小批量

The above approximations for the variational lower bound and its gradients have been formulated for a single training example xi\mathbf{x}_i but this can be easily extended to mini-batches XM=x1,,xM\mathbf{X}^M = \\{ \mathbf{x}_1, \ldots, \mathbf{x}_M \\} with MM random samples from a dataset X\mathbf{X} of NN i.i.d. observations. The lower bound of the full dataset L(θ,q;X)\mathcal{L}(\boldsymbol{\theta}, q; \mathbf{X}) can then be approximated as

上述变分下界及其梯度的近似值已针对单个训练示例 xi\mathbf{x}_i 制定,但这可以轻松扩展到小批量 XM= mathbfx1,,xM\mathbf{X}^M = \\{ \ mathbf{x}_1, \ldots, \mathbf{x}_M \\} 来自数据集 X\mathbf{X} of NN iid 的 MM 随机样本观察。完整数据集 L(θ,q;X)\mathcal{L}(\boldsymbol{\theta}, q; \mathbf{X}) 的下界可以近似为

L(θ,q;X)NMi=1ML(θ,q;xi)=LM(θ,q;XM)\begin{align*} \mathcal{L}(\boldsymbol{\theta}, q; \mathbf{X}) &\approx {N \over M} \sum_{i=1}^M \mathcal{L}(\boldsymbol{\theta}, q; \mathbf{x}_i) \\ &= \mathcal{L}^M(\boldsymbol{\theta}, q; \mathbf{X}^M) \tag{11} \end{align*}

Gradients of LM(θ,q;XM)\mathcal{L}^M(\boldsymbol{\theta}, q; \mathbf{X}^M) can be obtained as described above together with averaging over the mini-batch and used in combination with optimizers like Adam, for example, to update the parameters of the latent variable model. Sampling from the variational distribution qq and usage of mini-batches leads to noisy gradients, hence the term stochastic variational inference.

LM(θ,q;XM)\mathcal{L}^M(\boldsymbol{\theta}, q; \mathbf{X}^M) 的梯度可以如上所述获得,同时对小批量进行平均,并与优化器结合使用,例如例如,Adam 更新隐变量模型的参数。从变分分布 qq 采样和使用小批量会导致噪声梯度,因此术语随机变分推理

If MM is sufficiently large, for example M=100M = 100, then LL can be even set to 11 i.e. a single sample from the variational distribution per training example is sufficient to get a good gradient estimate on average.

如果 MM 足够大,例如 M=100M = 100,那么 LL 甚至可以设置为 11,即每个训练示例的变分分布中的单个样本足以获得良好的平均梯度估计。

Variational autoencoder

变分自编码器

From the perspective of a generative model, q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) is a probabilistic encoder because it generates a latent code ti\mathbf{t}_i for input image xi\mathbf{x}_i and p(xiti,θ)p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) is a probabilistic decoder because it generates or reconstructs an image xi\mathbf{x}_i from latent code ti\mathbf{t}_i. Optimizing the variational lower bound w.r.t. parameters θ\boldsymbol{\theta} and ϕ\boldsymbol{\phi} can therefore be regarded as training a probabilistic autoencoder or variational autoencoder (VAE)[1].

In this context, the first term on the RHS of Eq. (5)(5) can be interpreted as expected negative reconstruction error. The second term is a regularization term that encourages the variational distribution to be close to the prior over latent variables. If the regularization term is omitted, the variational distribution would collapse to a delta function and the variational autoencoder would degenerate to a “usual” deterministic autoencoder.

Implementation

For implementing a variational autoencoder, we make the following choices:

  • The variational distribution q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) is a multivariate Gaussian N(tiμ(xi,ϕ),σ2(xi,ϕ))\mathcal{N}(\mathbf{t}_i \lvert \boldsymbol\mu(\mathbf{x}_i, \boldsymbol{\phi}), \boldsymbol\sigma^2(\mathbf{x}_i, \boldsymbol{\phi})) with a diagonal covariance matrix where mean vector μ\boldsymbol\mu and the covariance diagonal σ2\boldsymbol\sigma^2 are functions of xi\mathbf{x}_i and ϕ\boldsymbol{\phi}. These functions are implemented as neural network and learned during optimization of the lower bound w.r.t. ϕ\boldsymbol{\phi}. After reparameterization, samples from q(tixi,ϕ)q(\mathbf{t}_i \lvert \mathbf{x}_i, \boldsymbol{\phi}) are obtained via the deterministic function g(ϵ,xi,ϕ)=μ(xi,ϕ)+σ2(xi,ϕ)ϵg(\boldsymbol\epsilon, \mathbf{x}_i, \boldsymbol{\phi}) = \boldsymbol\mu(\mathbf{x}_i, \boldsymbol{\phi}) + \boldsymbol\sigma^2(\mathbf{x}_i, \boldsymbol{\phi}) \odot \boldsymbol\epsilon and an auxiliary distribution p(ϵ)=N(ϵ0,I)p(\boldsymbol\epsilon) = \mathcal{N}(\boldsymbol\epsilon \lvert \mathbf{0}, \mathbf{I}).

  • The conditional distribution p(xiti,θ)p(\mathbf{x}_i \lvert \mathbf{t}_i, \boldsymbol{\theta}) is a multivariate Bernoulli distribution Ber(xik(ti,θ))\text{Ber}(\mathbf{x}_i \lvert \mathbf{k}(\mathbf{t}_i, \boldsymbol{\theta})) where parameter k\mathbf{k} is a function of ti\mathbf{t}_i and θ\boldsymbol{\theta}. This distribution models the binary training data i.e. monochrome (= binarized) MNIST images in our example. Function k\mathbf{k} computes for each pixel its expected value. It is also implemented as neural network and learned during optimization of the lower bound w.r.t. θ\boldsymbol{\theta}. Taking the (negative) logarithm of Ber(xik(ti,θ))\text{Ber}(\mathbf{x}_i \lvert \mathbf{k}(\mathbf{t}_i, \boldsymbol{\theta})) gives a sum over pixel-wise binary cross entropies as shown in Eq. (12)(12)

  • Prior p(t_iθ)p(\mathbf{t}\_i \lvert \boldsymbol{\theta}) is a multivariate Gaussian distribution N(t_i0,I)\mathcal{N}(\mathbf{t}\_i \lvert \mathbf{0}, \mathbf{I}) with zero mean and unit covariance matrix. With the chosen functional forms of the prior and the variational distribution qq, KL(q(t_ix_i,ϕ)p(t_iθ))\mathrm{KL}(q(\mathbf{t}\_i \lvert \mathbf{x}\_i, \boldsymbol{\phi}) \mid\mid p(\mathbf{t}\_i \lvert \boldsymbol{\theta})) can be integrated analytically to 12d=1D(1+logσi,d2μi,d2σi,d2)-{1 \over 2} \sum_{d=1}^D (1 + \log \sigma_{i,d}^2 - \mu_{i,d}^2 - \sigma_{i,d}^2) where DD is the dimensionality of the latent space and μi,d\mu_{i,d} and σi,d\sigma_{i,d} is the dd-th element of μ(xi,ϕ)\boldsymbol\mu(\mathbf{x}_i, \boldsymbol{\phi}) and σ(xi,ϕ)\boldsymbol\sigma(\mathbf{x}_i, \boldsymbol{\phi}), respectively.

Using these choices and setting L=1L = 1, the variational lower bound for a single image xi\mathbf{x}_i can be approximated as

L(θ,q;xi)c(xi,clogki,c+(1xi,c)log(1ki,c))+12d(1+logσi,d2μi,d2σi,d2)\begin{align*} \mathcal{L}(\boldsymbol{\theta}, q; \mathbf{x}_i) \approx &- \sum_c \left(x_{i,c} \log k_{i,c} + (1 - x_{i,c}) \log (1 - k_{i,c})\right) \\ &+ {1 \over 2} \sum_d (1 + \log \sigma_{i,d}^2 - \mu_{i,d}^2 - \sigma_{i,d}^2) \tag{12} \end{align*}

where xi,cx_{i,c} is the value of pixel cc in image x_i\mathbf{x}\_i and k_i,ck\_{i,c} its expected value. The negative value of the lower bound is used as loss during training. The following figure outlines the architecture of the variational autoencoder.

VAE

The definitions of the encoder and decoder neural networks were taken from [2]. Here, the encoder computes the logarithm of the variance, instead of the variance directly, for reasons of numerical stability.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from tensorflow.keras import layers
from tensorflow.keras.models import Model


def create_encoder(latent_dim):
"""
Creates a convolutional encoder for MNIST images.

Args:
latent_dim: dimensionality of latent space.
"""
encoder_iput = layers.Input(shape=(28, 28, 1))

x = layers.Conv2D(32, 3, padding='same', activation='relu')(encoder_iput)
x = layers.Conv2D(64, 3, padding='same', activation='relu', strides=(2, 2))(x)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
x = layers.Flatten()(x)
x = layers.Dense(32, activation='relu')(x)

q_mean = layers.Dense(latent_dim)(x)
q_log_var = layers.Dense(latent_dim)(x)

return Model(encoder_iput, [q_mean, q_log_var], name='encoder')


def create_decoder(latent_dim):
"""
Creates a convolutional decoder for MNIST images.

Args:
latent_dim: dimensionality of latent space.
"""
decoder_input = layers.Input(shape=(latent_dim,))

x = layers.Dense(12544, activation='relu')(decoder_input)
x = layers.Reshape((14, 14, 64))(x)
x = layers.Conv2DTranspose(32, 3, padding='same', activation='relu', strides=(2, 2))(x)
k = layers.Conv2D(1, 3, padding='same', activation='sigmoid')(x)

return Model(decoder_input, k, name='decoder')

These definitions are used to implement a VariationalAutoencoder model class.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import tensorflow as tf

class VariationalAutoencoder(Model):
def __init__(self, latent_dim=2):
"""
Creates a variational autoencoder Keras model.

Args:
latent_dim: dimensionality of latent space.
"""
super().__init__()
self.latent_dim = latent_dim
self.encoder = create_encoder(latent_dim)
self.decoder = create_decoder(latent_dim)

def encode(self, x):
"""
Computes variational distribution q statistics from
input image x.

Args:
x: input image, shape (M, 28, 28, 1).

Returns:
Mean, shape (M, latent_dim), and log variance,
shape (M, latent_dim), of multivariate Gaussian
distribution q.
"""
q_mean, q_log_var = self.encoder(x)
return q_mean, q_log_var

def sample(self, q_mean, q_log_var):
"""
Samples latent code from variational distribution q.

Args:
q_mean: mean of q, shape (M, latent_dim).
q_log_var: log variance of q, shape (M, latent_dim).

Returns:
Latent code sample, shape (M, latent_dim).
"""
eps = tf.random.normal(shape=q_mean.shape)
return q_mean + tf.exp(q_log_var * .5) * eps

def decode(self, t):
"""
Computes expected pixel values (= probabilities k) from
latent code t.

Args:
t: latent code, shape (M, latent_dim).

Returns:
Probabilities k of multivariate Bernoulli
distribution p, shape (M, 28, 28, 1).
"""
k = self.decoder(t)
return k

def call(self, x):
"""
Computes expected pixel values (= probabilities k) of a
reconstruction of input image x.

Args:
x: input image, shape (M, 28, 28, 1).

Returns:
Probabilities k of multivariate Bernoulli
distribution p, shape (M, 28, 28, 1).
"""
q_mean, q_log_var = self.encode(x)
t = self.sample(q_mean, q_log_var)
return self.decode(t)

The variational_lower_bound function is implemented using Eq. (12)(12) and Eq. (11)(11) but instead of estimating the lower bound for the full dataset it is normalized by the dataset size NN.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from tensorflow.keras.losses import binary_crossentropy

def variational_lower_bound(model, x):
"""
Computes normalized variational lower bound.

Args:
x: input images, shape (M, 28, 28, 1)

Returns:
Variational lower bound averaged over M
samples in batch and normalized by dataset
size N.
"""
q_mean, q_log_var = model.encode(x)
t = model.sample(q_mean, q_log_var)
x_rc = model.decode(t)

# Expected negative reconstruction error
neg_rc_error = -tf.reduce_sum(binary_crossentropy(x, x_rc), axis=[1, 2])

# Regularization term (negative KL divergence)
neg_kl_div = 0.5 * tf.reduce_sum(1 + q_log_var \
- tf.square(q_mean) \
- tf.exp(q_log_var), axis=-1)

# Average over mini-batch (of size M)
return tf.reduce_mean(neg_rc_error + neg_kl_div)

The training procedure uses the negative value of the variational lower bound as loss to compute stochastic gradient estimates. These are used by the optimizer to update model parameters θ\boldsymbol\theta and ϕ\boldsymbol\phi. The normalized variational lower bound of the test set is computed at the end of each epoch and printed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@tf.function
def train_step(model, optimizer, x):
"""Trains VAE on mini-batch x using optimizer.
"""
with tf.GradientTape() as tape:
# Compute neg. variational lower bound as loss
loss = -variational_lower_bound(model, x)
# Compute gradients from neg. variational lower bound
gradients = tape.gradient(loss, model.trainable_variables)
# Apply gradients to model parameters theta and phi
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss

def train(model, optimizer, ds_train, ds_test, epochs):
"""Trains VAE on training dataset ds_train using
optimizer for given number of epochs.
"""
for epoch in range(1, epochs + 1):
for x in ds_train:
train_step(model, optimizer, x)

vlb_mean = tf.keras.metrics.Mean()
for x in ds_test:
vlb_mean(variational_lower_bound(model, x))
vlb = vlb_mean.result()
print(f'Epoch: {epoch:02d}, Test set VLB: {vlb:.2f}')

Since the data are modelled with a multivariate Bernoulli distribution, the MNIST images are first binarized to monochrome images so that their pixel values are either 0 or 1. The training batch size is set to 100 to get reliable stochastic gradient estimates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from tensorflow.keras.datasets import mnist

(x_train, _), (x_test, y_test) = mnist.load_data()

x_train = (x_train > 127.5).astype('float32') # binarize
x_train = x_train.reshape(-1, 28, 28, 1)

x_test = (x_test > 127.5).astype('float32') # binarize
x_test = x_test.reshape(-1, 28, 28, 1)

batch_size = 100

ds_train = tf.data.Dataset.from_tensor_slices(x_train).shuffle(x_train.shape[0]).batch(batch_size)
ds_test = tf.data.Dataset.from_tensor_slices(x_test).shuffle(x_test.shape[0]).batch(batch_size)

We choose a two-dimensional latent space so that it can be easily visualized. Training the variational autoencoder with RMSProp as optimizer at a learning rate of 1e-3 for 20 epochs gives already reasonable results. This takes a few minutes on a single GPU.

1
2
vae = VariationalAutoencoder(latent_dim=2)
opt = tf.keras.optimizers.RMSprop(lr=1e-3)
1
2
3
4
5
train(model=vae,
optimizer=opt,
ds_train=ds_train,
ds_test=ds_test,
epochs=20)
Epoch: 01, Test set VLB: -166.56
Epoch: 02, Test set VLB: -158.25
Epoch: 03, Test set VLB: -154.44
Epoch: 04, Test set VLB: -152.20
Epoch: 05, Test set VLB: -150.47
Epoch: 06, Test set VLB: -148.30
Epoch: 07, Test set VLB: -148.63
Epoch: 08, Test set VLB: -146.66
Epoch: 09, Test set VLB: -145.61
Epoch: 10, Test set VLB: -147.64
Epoch: 11, Test set VLB: -148.42
Epoch: 12, Test set VLB: -143.86
Epoch: 13, Test set VLB: -143.31
Epoch: 14, Test set VLB: -145.67
Epoch: 15, Test set VLB: -143.78
Epoch: 16, Test set VLB: -143.29
Epoch: 17, Test set VLB: -142.25
Epoch: 18, Test set VLB: -142.99
Epoch: 19, Test set VLB: -143.39
Epoch: 20, Test set VLB: -143.31

The following figure shows the locations of test set images in latent space. Here, the mean vectors of the variational distributions are plotted. The latent space is organized by structural similarity of digits i.e. structurally similar digits have a smaller distance in latent space than structurally dissimilar digits. For example, digits 4 and 9 usually differ only by a horizontal bar or curve at the top of the image and are therefore in proximity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import matplotlib.pyplot as plt

%matplotlib inline

# Compute mean vectors of variational distributions (= latent code locations)
q_test_mean, _ = vae.encode(x_test)

# Use a discrete colormap
cmap = plt.get_cmap('viridis', 10)

# Plot latent code locations colored by the digit value on input images
im = plt.scatter(q_test_mean[:, 0], q_test_mean[:, 1], c=y_test, cmap=cmap,
vmin=-0.5, vmax=9.5, marker='x', s=0.2)

plt.colorbar(im, ticks=range(10));

png

When we sample locations in latent space (with density proportional to the prior density over latent variables) and decode these locations we can get a nice overview how MNIST digits are organized by structural similarity in latent space. Each digit is plotted with its expected pixel values k instead of using a sample from the corresponding multivariate Bernoulli distribution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np

from scipy.stats import norm

# Number of samples per latent space dimension
samples_per_dim = 20

# Size of plotted digits
digit_size = 28

# Sampling grid coordinates. Grid points density is
# proportial to density of latent variable prior.
grid_x = norm.ppf(np.linspace(0.05, 0.95, samples_per_dim))
grid_y = norm.ppf(np.linspace(0.05, 0.95, samples_per_dim))

figure = np.zeros((digit_size * samples_per_dim,
digit_size * samples_per_dim))

for i, x in enumerate(grid_x):
for j, y in enumerate(grid_y):
t_ij = np.array([[x, y]])
x_ij = vae.decode(t_ij)
digit = x_ij.numpy().reshape(digit_size, digit_size)
figure[j * digit_size: (j + 1) * digit_size,
i * digit_size: (i + 1) * digit_size] = digit

plt.figure(figsize=(10, 10))
plt.imshow(figure, cmap='Greys_r');

png

References

[1] Diederik P. Kingma, Max Welling Auto-Encoding Variational Bayes.
[2] François Chollet. Deep Learning with Python.