随机变分推断

【摘要】随机变分推断是一种用于近似后验分布的可扩展算法。我们为一般性的概率模型开发了该技术，并且用两个概率主题模型（潜狄利克雷分配和分层狄利克雷过程主题模型）来证明了它的可用性。我们使用随机变分推断分析了几个大型文档集合：来自 Nature 的 30 万篇文章、来自《纽约时报》的 180 万篇文章和来自维基百科的 380 万篇文章。结果表明：随机变分推断可以轻松处理如此大规模的数据集，并且优于只能处理较小数据集的传统变分推断。我们还表明贝叶斯非参数主题模型的表现要优于参数模型。

【原文】 Hoffman, M., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic Variational Inference. arXiv: http://arxiv.org/abs/1206.7051

1 问题提出

现代数据分析需要使用海量数据进行计算。想象下如下案例:

(1) 我们拥有 200 万本书的原始文本档案，经过扫描并在线存储。我们想发现文本中的主题，并按主题来组织书籍，最终为用户提供一个可以来探索收藏的浏览器。

(2) 我们有来自在线购物网站的数据，其中包含数百万用户的购买历史以及目录中每件商品的描述。我们想根据这些信息向用户推荐商品。

(3) 我们连续地从在线照片源中收集数据，希望根据这些数据构建一个分类器。

(4) 我们测量了数百万人的基因序列。想对观测到的基因和其他特征之间的联系做出假设。

这些问题说明了现代数据分析面临的一些挑战。我们的数据复杂且高维；我们可以根据科学、直觉或其他数据分析做出一些假设，这些假设涉及存在于数据之中但无法直接被观测到的隐含结构；最后，数据集实在太大了，甚至有可能以流的形式永无止境地到达。

统计机器学习研究通过开发概率模型解决了其中一些挑战，该领域为开发新的数据分析方法提供了一种优雅的方法（Pearl，1988；Jordan，1999；Bishop，2006；Kollerand Friedman，2009；Murphy， 2012）。特别是：概率图模型为我们提供了一种可视化语言，来表达关于数据及其隐含结构的假设。相应的后验推断算法则让我们在这些假设下分析数据，推断出最能解释我们观测结果的隐藏结构。

In descriptive tasks, like problems #1 and #4 above, graphical models help us explore the data — the organization of books or the connections between genes and traits — with the hidden structure probabilistically “filled in.” In predictive tasks, like problems #2 and #3, we use models to form predictions about new observations. For example, we can make recommendations to users or predict the class labels of new images. With graphical models, we enjoy a powerful suite of probability models to connect and combine; and we have general-purpose computational strategies for connecting models to data and estimating the quantities needed to use them

在描述性任务中，如上面的问题 #1 和 #4，概率图模型帮助我们使用具有概率性质的隐含结构来探索数据，如：书籍的组织、基因与特征之间的联系等。在预测任务中，如问题 #2 和 #3，可以模型形成对新观测的预测。例如，我们可以向用户推荐、预测新图像的类别标签等。通过概率图模型，我们享用了一套强大的概率模型套件来帮助连接和组合，我们具有一般目的的计算函数将模型连接到数据并估计使用它们所需的数量。

The problem we face is scale. Inference algorithms of the 1990s and 2000s used to be considered scalable, but they cannot easily handle the amount of data that we described in the four examples above. This is the problem we address here. We present an approach to computing with graphical models that is appropriate for massive data sets, data that might not fit in memory or even be stored locally. Our method does not require clusters of computers or specialized hardware, though it canbe further sped up with these amenities

我们面临的问题是规模。 1990 年代和 2000 年代的推断算法曾经被认为是可扩展的，但它们无法轻松处理我们在上面四个示例中描述的数据量。这是我们在这里解决的问题。我们提出了一种使用图形模型进行计算的方法，该方法适用于海量数据集、可能不适合内存甚至无法存储在本地的数据。我们的方法不需要计算机集群或专用硬件，尽管可以通过这些便利设施进一步加快速度

As an example of this approach to data analysis, consider topic models. Topic models are probabilistic models of text used to uncover the hidden thematic structure in a collection of documents(Blei, 2012). The main idea in a topic model is that there are a set of topics that describe the collection and each document exhibits those topics with different degrees. As a probabilistic model, thetopics and how they relate to the documents are hidden structure and the main computational problem is to infer this hidden structure from an observed collection. Figure 1 illustrates the results of ouralgorithm on a probabilistic topic model. These are two sets of topics, weighted distributions overthe vocabulary, found in 1.8M articles from the New York Times and 300,000 articles from Nature.Topic models are motivated by applications that require analyzing massive collections of documents like this, but traditional algorithms for topic model inference do not easily scale collections of thissize.

作为这种数据分析方法的一个例子，考虑主题模型。主题模型是文本的概率模型，用于揭示文档集合中隐藏的主题结构（Blei，2012）。主题模型的主要思想是有一组主题描述集合，每个文档以不同的程度展示这些主题。作为概率模型，主题及其与文档的关系是隐藏结构，主要的计算问题是从观察到的集合中推断出这种隐藏结构。图 1 说明了我们的算法在概率主题模型上的结果。这是两组主题，词汇的加权分布，在纽约时报的 180 万篇文章和自然的 300,000 篇文章中找到。主题模型的动机是需要分析像这样的大量文档集合的应用程序，但主题模型的传统算法推断不容易缩放此大小的集合。

Our algorithm builds on variational inference, a method that transforms complex inference problems into high-dimensional optimization problems (Jordan et al., 1999; Wainwright and Jordan,2008). Traditionally, the optimization is solved with a coordinate ascent algorithm, iterating between re-analyzing every data point in the data set and re-estimating its hidden structure. This is inefficient for large data sets, however, because it requires a full pass through the data at each iteration.

我们的算法建立在变分推断之上，这是一种将复杂推断问题转化为高维优化问题的方法（Jordan 等，1999；Wainwright 和 Jordan，2008）。传统上，优化是通过坐标上升算法解决的，在重新分析数据集中的每个数据点和重新估计其隐藏结构之间进行迭代。然而，这对于大型数据集来说是低效的，因为它需要在每次迭代时完全遍历数据。

In this paper we derive a more efficient algorithm by using stochastic optimization (Robbins and Monro, 1951), a technique that follows noisy estimates of the gradient of the objective. When used in variational inference, we show that this gives an algorithm which iterates between subsampling the data and adjusting the hidden structure based only on the subsample. This is much more efficient than traditional variational inference. We call our method stochastic variational inference.

在本文中，我们通过使用随机优化（Robbins 和 Monro，1951）推导出更有效的算法，该技术遵循目标梯度的噪声估计。当用于变分推断时，我们表明这给出了一种算法，该算法在对数据进行子采样和仅基于子样本调整隐藏结构之间进行迭代。这比传统的变分推断更有效。我们称我们的方法为随机变分推断。

We will derive stochastic variational inference for a large class of graphical models. We willstudy its performance on two kinds of probabilistic topic models. In particular,we demonstrate stochastic variational inference on latent Dirichlet allocation (Blei et al., 2003), a simple topic model,and the hierarchical Dirichlet process topic model (Teh et al., 2006a), a more flexible model where the number of discovered topics grows with the data. (This latter application demonstrates howto use stochastic variational inference in a variety of Bayesian nonparametric settings.) Stochastic variational inference can efficiently analyze massive data sets with complex probabilistic models

我们将为一大类图形模型推导出随机变分推断。我们将研究它在两种概率主题模型上的表现。特别是，我们展示了对潜在 Dirichlet 分配（Blei 等人，2003）、一个简单的主题模型和分层 Dirichlet 过程主题模型（Teh 等人，2006a）的随机变分推断，一个更灵活的模型，其中的数量发现的主题随着数据的增长而增长。（后一个应用程序演示了如何在各种贝叶斯非参数设置中使用随机变分推断。）随机变分推断可以有效地分析具有复杂概率模型的海量数据集

Technical summary.We now turn to the technical context of our method. In probabilistic modeling, we use hidden variables to encode hidden structure in observed data; we articulate the relationship between the hidden and observed variables with a factorized probability distribution (i.e., a graphical model); and we use inference algorithms to estimate the posterior distribution, the conditional distribution of the hidden structure given the observations.

技术摘要。我们现在转向我们方法的技术背景。在概率建模中，我们使用隐藏变量对观测数据中的隐藏结构进行编码；我们用分解的概率分布（即图形模型）阐明隐藏变量和观察变量之间的关系；我们使用推断算法来估计后验分布，即给定观察结果的隐藏结构的条件分布。

Consider a graphical model of hidden and observed random variables for which we want to compute the posterior. For many models of interest, this posterior is not tractable to compute and we must appeal to approximate methods. The two most prominent strategies in statistics and machine learning are Markov chain Monte Carlo (MCMC) sampling and variational inference. In MCMC sampling, we construct a Markov chain over the hidden variables whose stationary distribution is the posterior of interest (Metropolis et al., 1953; Hastings, 1970; Geman and Geman, 1984; Gelfandand Smith, 1990; Robert and Casella, 2004). We run the chain until it has (hopefully) reached equilibrium and collect samples to approximate the posterior. In variational inference, we define aflexible family of distributions over the hidden variables, indexed by free parameters (Jordan et al.,1999; Wainwright and Jordan, 2008). We then find the setting of the parameters (i.e., the memberof the family) that is closest to the posterior. Thus we solve the inference problem by solving an optimization problem

考虑我们想要计算后验的隐藏和观察随机变量的图形模型。对于许多感兴趣的模型，这种后验是难以计算的，我们必须求助于近似方法。统计学和机器学习中最突出的两种策略是马尔可夫链蒙特卡罗 (MCMC) 采样和变分推断。在 MCMC 采样中，我们在其平稳分布是兴趣后验的隐藏变量上构建马尔可夫链（Metropolis 等，1953；Hastings，1970；Geman 和 Geman，1984；Gelfand Smith，1990；Robert 和 Casella，2004） .我们运行链直到它（希望）达到平衡并收集样本以近似后验。在变分推断中，我们定义了隐藏变量上的灵活分布族，由自由参数索引（Jordan 等人，1999 年；Wainwright 和 Jordan，2008 年）。然后我们找到最接近后验的参数设置（即家庭成员）。因此，我们通过解决优化问题来解决推断问题

Neither MCMC nor variational inference scales easily to the kinds of settings described in the first paragraph. Researchers have proposed speed-ups of both approaches, but these usually are tailored to specific models or compromise the correctness of the algorithm (or both). Here, we develop a general variational method that scales.

MCMC 和变分推断都不能轻松扩展到第一段中描述的设置类型。研究人员提出了两种方法的加速，但这些通常是针对特定模型量身定制的，或者会损害算法的正确性（或两者兼而有之）。在这里，我们开发了一种可扩展的通用变分方法。

As we mentioned above, the main idea in this work is to use stochastic optimization (Robbinsand Monro, 1951; Spall, 2003). In stochastic optimization, we find the maximumof an objective function by following noisy (but unbiased) estimates of its gradient. Under the right conditions,stochastic optimization algorithms provably converge to an optimum of the objective. Stochastic optimization is particularly attractive when the objective (and therefore its gradient) is a sum ofmany terms that can be computed independently. In that setting, we can cheaply compute noisy gradients by subsampling only a few of these terms.

正如我们上面提到的，这项工作的主要思想是使用随机优化（Robbinsand Monro，1951；Spall，2003）。在随机优化中，我们通过跟踪其梯度的噪声（但无偏）估计来找到目标函数的最大值。在合适的条件下，随机优化算法可证明收敛到目标的最优值。当目标（及其梯度）是许多可以独立计算的项的总和时，随机优化特别有吸引力。在这种情况下，我们可以通过仅对这些项中的几个进行二次采样来廉价地计算噪声梯度。

Variational inference is amenable to stochastic optimization because the variational objective decomposes into a sum of terms, one for each data point in the analysis. We can cheaply obtain noisy estimates of the gradient by subsampling the data and computing a scaled gradient on the subsample. If we sample independently then the expectation of this noisy gradient is equal tothe true gradient. With one more detail — the idea of a natural gradient (Amari, 1998) — stochastic variational inference has an attractive form:

变分推断适用于随机优化，因为变分目标分解为一组项，分析中的每个数据点一个项。我们可以通过对数据进行二次采样并在子样本上计算缩放梯度来廉价地获得梯度的噪声估计。如果我们独立采样，那么这个嘈杂梯度的期望等于真实梯度。还有一个细节——自然梯度的想法 (Amari, 1998)——随机变分推断有一种吸引人的形式：

Subsample one or more data points from the data.
Analyze the subsample using the current variational parameters.
Implement a closed-form update of the variational parameters.
Repeat.
从数据中对一个或多个数据点进行二次抽样。
分析使用当前变分参数。
子样本。实施变分参数的封闭形式更新。
重复。

While traditional algorithms require repeatedly analyzing the whole data set before updating thevariational parameters, this algorithm only requires that we analyze randomly sampled subsets. Wewill show how to use this algorithm for a large class of graphical models.

传统算法在更新变分参数之前需要反复分析整个数据集，而该算法只需要我们分析随机采样的子集。我们将展示如何将该算法用于一大类图形模型。

2 相关工作

Variational inference for probabilistic models was pioneered in the mid-1990s.In Michael Jordan’s lab, the seminal papers of Saul et al. (1996); Saul and Jordan (1996) andJaakkola (1997) grew out of reading the statistical physics literature (Peterson and Anderson, 1987;Parisi, 1988). In parallel, the mean-field methods explained in Neal and Hinton (1999) (originallypublished in 1993) and Hinton and Van Camp (1993) led to variational algorithms for mixtures of experts (Waterhouse et al., 1996).

In subsequent years, researchers began to understand the potential for variational inference in more general settings and developed generic algorithms for conjugate exponential-family models(Attias, 1999, 2000; Wiegerinck, 2000; Ghahramani and Beal, 2001; Xing et al., 2003). These innovations led to automated variational inference, allowing a practitioner to write down a modeland immediately use variational inference to estimate its posterior (Bishop et al., 2003). For goodreviews of variational inference see Jordan et al. (1999) and Wainwright and Jordan (2008).

In this paper, we develop scalable methods for generic Bayesian inference by solving the variational inference problem with stochastic optimization (Robbins and Monro, 1951). Our algorithm builds on the earlier approach of Sato (2001), whose algorithm only applies to the limited set ofmodels that can be fit with the EM algorithm (Dempster et al., 1977). Specifically, we generalizehis approach to the much wider set of probabilistic models that are amenable to closed-form coordinate ascent inference. Further, in the sense that EM itself is a mean-field method (Neal and Hinton,1999), our algorithm builds on the stochastic optimization approach to EM (Cappé and Moulines,2009). Finally, we note that stochastic optimization was also used with variational inference in Plattet al. (2008) for fast approximate inference in a specific model of web service activity.

For approximate inference, the main alternative to variational methods is Markov chain MonteCarlo (MCMC) (Robert and Casella, 2004). Despite its popularity in Bayesian inference, relatively little work has focused on developing MCMC algorithms that can scale to very large data sets. One exception is sequential Monte Carlo, although these typically lack strong convergence guarantees (Doucet et al., 2001). Another is the stochastic gradient Langevin method of Welling and Teh(2011), which enjoys a symptotic convergence guarantees and also takes advantage of stochastic optimization. Finally, in topic modeling, researchers have developed several approaches to parallel MCMC (Newman et al., 2009; Smola and Narayanamurthy, 2010; Ahmed et al.,2012).

The organization of this paper.In Section 2, we review variational inference for graphical models and then derive stochastic variational inference.

In Section 3, we review probabilistic topic models and Bayesian nonparametric models and then derive the stochastic variational inference algorithms in these settings.

In Section 4, we study stochastic variational inference on several large text data sets