comment:: 一种利用神经网络来推断分类任务中不确定性的方法,其主要特点在于从理论上对贝叶斯公式进行了扩展,将原先隐式的分布不确定性显式化,通过对模型参数的边缘化,首先得到分布不确定性的估计,而后通过对分布不确定性的边缘化,得到预测分布。

1 概述

1.1 研究背景

  • Bayesian Neural Networks have been computationally more demanding and conceptually more complicated.

  • Monte-Carlo Dropout using an ensemble of multiple stochastic forward passes and computing the mean and spread of the ensemble.

  • Deep Ensembles yields competitive uncertainty estimates to MC dropout

    • Another class of approaches involves explicitly training a model in a multi-task fashion to minimize its Kullback-Leibler (KL) divergence to both a sharp in-domain predictive posterior and a flat out-of-domain predictive posterior. the out-of-domain inputs are sampled either from a synthetic noise distribution or a different dataset during training.  It explicitly trained to detect out-of-distribution inputs, and  have the advantage of being more computationally efficient.

1.2 存在的问题

  • The problom is: they conflate different aspects of predictive uncertainty — model uncertainty, data uncertainty and distributional uncertainty.

    • Model uncertainty is reducible as the size of training data increases
    • Data uncertainty is irreducible uncertainty which arises from the natural complexity of the data be considered a ’known-uncertainty’
    • Distributional uncertainty is an ’unknown-uncertainty’ , the model is unfamiliar with the test data and thus cannot confidently make predictions.
  • The approaches discussed above
    - 要么是将分布不确定性和数据不确定性混淆在一起

    • 要么将分布不确定性隐含在模型不确定性中

    • But , the ability to separately the 3 types of predictive uncertainty is very important, as different actions can be taken by the model depending on the source of uncertainty.

1.3 题眼

This paper addresses

  • the explicit prediction of each of the three types of predictive uncertainty ( by extending the work done by [[@Malinin2017-IncorporatingUncertaintyIntoDeepLearningForSpokenLanguageAssessment(PriorNetwork)]] ), while taking inspiration from Bayesian approaches.

  • scope:

    • This work focuses on classification tasks and presents a discussion of uncertainty metrics

1.4 贡献

  • Proposes a new framework for modeling predictive uncertainty called Prior Networks (PNs). In which distributional uncertainty is treated as distinct from both data uncertainty and model uncertainty.
  • Evaluated on the tasks of identifying out-of-distribution (OOD) samples and detecting misclassification, got sota results.

2 Current Approaches to Uncertainty Estimation

2.1 贝叶斯方法描述

Consider a distribution p(x,y)\mathrm{p}(\boldsymbol{x}, y) over input features x\boldsymbol{x} and labels yy. For image classification x\boldsymbol{x} corresponds to images and yy object labels. In a Bayesian framework:

  • The predictive uncertainty of a classification model P(ωcx,D)\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \mathcal{D}\right) trained on a finite dataset D={xj,yj}j=1Np(x,y)\mathcal{D}=\left\{\boldsymbol{x}_{j}, y_{j}\right\}_{j=1}^{N} \sim \mathrm{p}(\boldsymbol{x}, y) will result from data (aleatoric) uncertainty and model (epistemic) uncertainty.
  • Data uncertainty are described by the posterior distribution over class labels( named predictive distribution or predicted distribution ), when a set of model parameters θ\boldsymbol{\theta} is given. Thus, the likelihood .
  • Model uncertainty is described by the posterior distribution over the parameters given the data. Thus, the posterior distribution.

P(ωcx,D)=P(ωcx,θ)Data p(θD)Model dθ(1)\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \mathcal{D}\right)=\int \underbrace{\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}\right)}_{\text {Data }} \underbrace{\mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D})}_{\text {Model }} d \boldsymbol{\theta} \tag{1}

Uncertainty in the model parameters induces the distribution over distributions P(ωcx,θ)\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}\right). Thus, P(ωcx,θ)\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}\right) is not determetic but random .

The expected distribution P(ωcx,D)\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \mathcal{D}\right) is obtained by marginalizing out the parameters θ\boldsymbol{\theta}.

The expected distribution here means the catelogical distribution of the output ωc\omega_{c}^{*}. Can be understood as the point estimate of predictive distribution.

2.2 贝叶斯方法的问题

问题 1:求解后验比较困难

The true posterior p(θD)\mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D})  is intractable, and it is necessary to use variational distribution q(θ)\mathrm{q}(\theta) to approximation the true posterior :

p(θD)q(θ)(2)\mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D}) \approx \mathrm{q}(\boldsymbol{\theta}) \tag{2}

The true posterior means  posterior distribution of parameter θ\boldsymbol{\theta}.

问题 2:即便近似得到了后验,最终边缘化的积分计算也很困难

Furthermore, the integral in eq.1 is also intractable for neural networks, and is typically approximated via sampling (eq.3 ), using approaches like Monte-Carlo dropout, Langevin Dynamics or explicit ensembling.

Thus,

P(ωcx,D)1Mi=1MP(ωcx,θ(i)),θ(i)q(θ)(3)\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \mathcal{D}\right) \approx \frac{1}{M} \sum_{i=1}^{M} \mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}^{(i)}\right), \boldsymbol{\theta}^{(i)} \sim \mathrm{q}(\boldsymbol{\theta}) \tag{3}

2.3 单纯形上的解释

  • 公式 3 中每个求和项 P(ωcx,θ(i))P\left(\omega_{c} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}^{(i)}\right) 均对应一个由样本 θ(i)\theta^{(i)} 推导得到的类别分布,该分布对应于单纯形中的一个点

  • 所有样本对应的类别分布在一起,即 {P(ωcx,θ(i))}i=1M\left\{\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}^{(i)}\right)\right\}_{i=1}^{M} ,(见 图 1a ),形成单纯形上的一个分布( 图 1b ),或者反过来理解,单纯性每一个样本输出的类别分布均是来自单纯形上某个分布的样本,而该分布可以代表由模型参数引入的不确定性。

图 1:单纯形上的分布

  • 对于传统的贝叶斯方法而言,通过先验设置和推断算法选择,最终能够得到后验预测分布。

    • 当测试点与训练数据相符时,其在单纯形上的分布应当能够与训练数据保持一致 ;
    • 当测试点远离训练数据时,则会由于不掌握分布外数据的信息,产生比较分散的几何形态;
    • 也就是说,当测试点与训练数据接近时,其输出的不确定性表现为单纯形上的尖锐分布;当测试点远离训练数据时,其输出的不确定性表现为单纯形上的扁平分布
  • 可以将单纯形上分布的熵视为预测不确定的指示,但很难通过熵来区分数据不确定性还是分布不确定性,因此有必要测量样本的散布程度,如利用互信息指标,以评估是否来自模型不确定性。

此处从单纯上解释了一个特性:可以通过单纯形上分布的位置和形状特性,来区分分布外和分布内样本。

2.4 贝叶斯方法小结

  • 特点:通过分析单纯形上的分布,能够区分分布内和分布外样本
  • 问题:
    • 上述单纯形分析是隐式的,只能意会不能言传,需要一种显式建模方法
    • 只有选择了模型先验和近似推断方法,才能得到后验分布,进而得到单纯形上的分布

2.5 非贝叶斯方法的问题

  • 特点:对于分布外样本,经过训练的深度神经网络会产生高熵的后验预测分布
  • 问题:高熵并不意味着一定是分布外样本,分布内样本也会出现高熵的情况,不太可能鲁棒地区分两者

3 先验网络

3.1 基本思路

拟借鉴贝叶斯方法的一些思路,对贝叶斯方法中的隐式单纯形分布进行显式建模,并利用分布内样本的理想分布形态(尖锐分布)和分布外样本的理想形态(扁平而弥散的分布)来构建损失函数。显式建模工具采用的是深度密度网络(Deep Density Network, DDN)

  • This work use a DDN  to explicitly parameterize a distribution over distributions on a simplex, p(μx,θ)p(\mu \mid x^{*},\boldsymbol{\theta}) ,the DDN is called Prior Network
  • Train the DDN to behave like the implicit distribution in the Bayesian approach

According to the analysis of distribution over simplex, Prior Network must have abilities below:

  • For ‘in-of-distribution’ inputs,it should yield a sharp distribution:
    • When it is confident in its prediction, a Prior Network should yield a sharp distribution centered on one of the corners of the simplex ( fig 2a)
    • When the input with high degrees of noise or class overlap (data uncertainty), a Prior Network should yield a sharp distribution focused on the center of the simplex,which corresponds to being confident in predicting a flat categorical distribution over class labels ( 模型非常有信心地给出了均匀类别分布的结果,表明无法判断具体是和类别,这被称为 已知的不确定性 known-unknown ) (fig 2b)
  • For ’out-of-distribution’ inputs, it should yield a flat distribution
    • means it’s unconfident( flat,disverse ) and cann’t give a explicit result(at the center of the simplex) , indicating large uncertainty in the mapping xy\boldsymbol{x} \mapsto y( 即 未知的不确定性,unknown-unknown ) (fig 2c)

3.2 技术途径

  • In the Bayesian framework, distributional uncertainty is considered a part of model uncertainty.
  • In this work it will be considered to be a source of uncertainty separately, Prior Networks will be explicitly constructed to capture data uncertainty and distributional uncertainty.
  • In Prior Network, data uncertainty is described by the point-estimate categorical distribution μ\mu( indicating the position over simplex), distributional uncertainty is described by the distribution over predictive categoricals, p(μx,θ)p(\mu \mid x^{*},\theta), ( indicationg the disverse or flat degree of the distribution )
  • The parameters  θ\theta of the Prior Network must encapsulate knowledge both about the in-domain distribution and the decision boundary which separates the in-domain region from everything else.

The relationship between uncertainties is made explicit:

  • Estimates of distributional uncertainty affected by model uncertainty
  • which in turn affects the estimates of data uncertainty

P(ωcx,D)=p(ωcμ)Data p(μx,θ)Distributional p(θD)Model dμdθ(4)\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*}, \mathcal{D}\right)=\iint \underbrace{\mathrm{p}\left(\omega_{c} \mid \boldsymbol{\mu}\right)}_{\text {Data }} \underbrace{\mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}\right)}_{\text {Distributional }} \underbrace{\mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D})}_{\text {Model }} d \boldsymbol{\mu} d \boldsymbol{\theta} \tag{4}

也就是说:

A large degree of model uncertainty will yield a large variation in distributional uncertainty p(μx,θ)p(\mu \mid x^{*},\theta) ,  large uncertainty in p(μx,θ)p(\mu \mid x^{*},\theta) will lead to a large uncertainty in estimates of data uncertainty.

There are now three layers of uncertainty:

  • the posterior over classes ( predictive catelogical distribution , indicating the resulted data uncertainty)
  • the per-data prior distribution ( distribution over simplex indicating th property of input)
  • the global posterior distribution over model parameters ( indicating model uncertainty)

在非贝叶斯方法中,曾经有过类似的结构,例如 Dirichlet 分配 ,不过,添加的额外不确定性层主要是为了增加模型的灵活性,而且通过边缘化或采样获得预测。

In this work, the additional level of uncertainty is added in order to be able to extract additional measures of uncertainty, depending on how the model is marginalized.

For example,  you can margininalize out μ\mu, and obtain:

[p(ωcμ)p(μx,θ)dμ]p(θD)dθ=p(ωcx,θ)p(θD)dθ(6)\int \left[ \int\mathrm{p}\left(\omega_{c} \mid \boldsymbol{\mu}\right) \mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}\right)d \boldsymbol{\mu}\right] \mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D}) d \boldsymbol{\theta} =\int \mathrm{p}\left(\omega_{c} \mid \boldsymbol{x}^{*},\boldsymbol{\theta}\right) \mathrm{p}\left(\boldsymbol{\theta} \mid \mathcal{D}\right) d \boldsymbol{\theta} \tag{6}

This marginalizing method lose the information of μ\mu, and lead directly to the result predictive distribution. So we  don’t know how sharp or flat around the point estimate. And if it’s flat, we don’t know whether it’s due to data uncertainty or distributional uncertainty. So methods using this margininalization must take a external measurement to assess the spread of predictive distribution ( such as  MC ensemble, mutual information), and then establish the source of uncertainty, but it still is hard to distinct distributional uncertainty.

The Proir Networks  can be viewed as an ’extra tool in the uncertainty toolbox’, which is explicitly crafted to capture the effects of distributional mismatch in a probabilistically interpretable way. It can yield expected estimates of data and distributional uncertainty when model uncertainty is given . Thus,

p(ωcμ)[p(μx,θ)p(θD)dθ]dμ=p(ωcμ)p(μx,D)dμ(6)\int \mathrm{p}\left(\omega_{c} \mid \boldsymbol{\mu}\right)\left[\int \mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}\right) \mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D}) d \boldsymbol{\theta}\right] d \boldsymbol{\mu}=\int \mathrm{p}\left(\omega_{c} \mid \boldsymbol{\mu}\right) \mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*}, \mathcal{D}\right) d \boldsymbol{\mu} \tag{6}

The model is redefined as p(ωcμ)\mathrm{p}\left(\omega_{c} \mid \boldsymbol{\mu}\right) , The distribution over model parameters became p(μx,D)\mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*}, \mathcal{D}\right) which condition on training data D\mathcal{D} and the test input x\boldsymbol{x}^{*} . This explicitly yields the distribution over the simplex which the Bayesian approach implicitly induces. And we can analysize the property of this distribution to distinguish the source of uncertainty.

The marginalization [p(μx,θ)p(θD)dθ]\left[\int \mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}\right) \mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D}) d \boldsymbol{\theta}\right] in eq. 6 is generally intractable, it can be approximated via Bayesian MC methods. For simplicity, this work will assume that a point-estimate (eq. 7) of the parameters will be sufficient

p(θD)=δ(θθ^)p(μx;D)p(μx;θ^)(7)\mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D})=\delta(\boldsymbol{\theta}-\hat{\boldsymbol{\theta}}) \Longrightarrow \mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*} ; \mathcal{D}\right) \approx \mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*} ; \hat{\boldsymbol{\theta}}\right) \tag{7}

The distributional certainty estimated in this work uses parameters’ point-estimates for simplity.

3.3 Dirichlet 先验网络

A Prior Network for classification parametrizes a distribution over a simplex such as  Dirichlet,

Mixture of Dirichlet distributions, Logistic-Normal distribution. And in this work, the Dirichlet distribution is chosen.

Dirichlet distribution is a prior distribution over categorical distribution which is parameterized by its concentration parameters α\boldsymbol{\alpha},  where α0\alpha_0( the sum of all αc\alpha_c ) is called the precision of the Dirichlet distribution. Higher values of α0\alpha_0 lead to sharper distributions.

Dir(μα)=Γ(α0)c=1KΓ(αc)c=1Kμcαc1,αc>0,α0=c=1Kαc(8)\operatorname{Dir}(\boldsymbol{\mu} \mid \boldsymbol{\alpha})=\frac{\Gamma\left(\alpha_{0}\right)}{\prod_{c=1}^{K} \Gamma\left(\alpha_{c}\right)} \prod_{c=1}^{K} \mu_{c}^{\alpha_{c}-1}, \quad \alpha_{c}>0, \alpha_{0}=\sum_{c=1}^{K} \alpha_{c} \tag{8}

A Prior Network which parametrizes a Dirichlet is a Dirichlet Prior Network (DPN). A DPN will generate the concentration parameters α\alpha of the Dirichlet distribution.

p(μx;θ^)=Dir(μα),α=f(x;θ^)(9)\mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*} ; \hat{\boldsymbol{\theta}}\right)=\operatorname{Dir}(\boldsymbol{\mu} \mid \boldsymbol{\alpha}), \quad \boldsymbol{\alpha}=\boldsymbol{f}\left(\boldsymbol{x}^{*} ; \hat{\boldsymbol{\theta}}\right) \tag{9}

The predictive distribution over class labels will be given by the mean of the Dirichlet:

P(ωcx;θ^)=p(ωcμ)p(μx;θ^)dμ=αcα0(10)\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*} ; \hat{\boldsymbol{\theta}}\right)=\int \mathrm{p}\left(\omega_{c} \mid \boldsymbol{\mu}\right) \mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*} ; \hat{\boldsymbol{\theta}}\right) d \boldsymbol{\mu}=\frac{\alpha_{c}}{\alpha_{0}} \tag{10}

If an exponential output function is used for the DPN ( where αc=ezc\alpha_{c}=e^{z_{c}} ), then the expected posterior probability of a label ωc\omega_{c} is given by the output of the softmax.

P(ωcx;θ^)=ezc(x)k=1Kezk(x)(11)\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*} ; \hat{\boldsymbol{\theta}}\right)=\frac{e^{z_{c}\left(\boldsymbol{x}^{*}\right)}}{\sum_{k=1}^{K} e^{z_{k}\left(\boldsymbol{x}^{*}\right)}} \tag{11}

Standard DDNs for classification with a softmax output function can be viewed as predicting the expected categorical distribution under a Dirichlet prior. The mean is insensitive to αc\alpha_{c},  the precision α0\alpha_{0}  is degenerate under standard cross-entropy training. It is necessary to change the cost function to train a DPN to yield a sharp or flat prior distribution around the expected categorical.

3.4 Dirichlet 先验网络的训练

The DPN is explicitly trained in a multi-task fashion to minimize the KL divergence between the model and a sharp Dirichlet distribution for in-distribution data, and the  KL divergence  between the model and a flat Dirichlet distribution for out-of-distribution data.

L(θ)=Epin (x)[KL[Dir(μα^)p(μx;θ)]]+Epout (x)[KL[Dir(μα~)p(μx;θ)]](12)\mathcal{L}(\boldsymbol{\theta})=\mathbb{E}_{\mathrm{p}_{\text {in }}(\boldsymbol{x})}[K L[\operatorname{Dir}(\boldsymbol{\mu} \mid \hat{\boldsymbol{\alpha}}) \| \mathrm{p}(\boldsymbol{\mu} \mid \boldsymbol{x} ; \boldsymbol{\theta})]]+\mathbb{E}_{\mathrm{p}_{\text {out }}(\boldsymbol{x})}[K L[\operatorname{Dir}(\boldsymbol{\mu} \mid \tilde{\boldsymbol{\alpha}}) \| \mathrm{p}(\boldsymbol{\mu} \mid \boldsymbol{x} ; \boldsymbol{\theta})]] \tag{12}

In order to train using this loss function, the target distribution ( α^\hat \alpha for sharp distribution and  α~\tilde \alpha for flat distribution)  must be defined.

  • flat distribution α~\tilde \alpha  can be specified by by setting all ̃ αc=1\alpha_c=1
  • sharp distribution α^\hat \alpha can not be directly set, so reparameterized it by  precision parameter ( α^_0\hat \alpha\_0 ) and means ( μ^_c=α^_cα^_0\hat \mu\_c=\frac{\hat \alpha\_c}{\hat \alpha\_0} ).  α^c\hat \alpha_c is a hyper-parameter, and means are 1-hot target used for classification.

难点在于:在定义的 KL 损失下学习稀疏的 “1-hot” 连续分布,非常具有挑战性。有两种解决方案:

  • smooth the target means, 重新分配少量概率密度到 Dirichlet 的其他角
  • teacher-student training,  可用于指定非稀疏目标均值 μ^\hat \mu

本文采用第一种方案。另外,交叉熵可以用于分布内数据的辅助损失。

μ^c={1(K1)ϵ if δ(y=ωc)=1 ϵ if δ(y=ωc)=0(13)\hat{\mu}_{c}= \begin{cases}1-(K-1) \epsilon & \text { if } \delta\left(y=\omega_{c}\right)=1 \ \epsilon & \text { if } \delta\left(y=\omega_{c}\right)=0\end{cases} \tag{13}

多任务训练目标( eq.12 )需要域外分布 pout(x)p_{out}(x) 的样本 x~\tilde x 。但真正的域外分布是未知的,样本不可用。解决方案:

  • 使用生成模型来合成在域区域边界上的点
  • 使用来自域外分布的不同实际数据集作为样本

4 Uncertainty measures

This section explores a range of measures for quantifying uncertainty:

4.1 (预测分布的)最大概率和熵

The first class measures uncertainty from the expected predictive categorical ( P(ωcx;D)P(\omega_c \mid x^{*};\mathcal{D}) ) which can be approximated either with a point estimate of the parameters θ^\hat \theta or a Bayesian MC ensemble

  • The first measure is the probability of the predicted class (mode), or max probability, which is confidence in the prediction.

P=maxcP(ωcx;D)\mathcal{P}=\max _{c} \mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*} ; \mathcal{D}\right)

  • The second measure is the entropy of the predictive distribution. It represents the uncertainty encapsulated in the entire distribution.

H[P(yx;D)]=c=1KP(ωcx;D)ln(P(ωcx;D))\mathcal{H}\left[\mathrm{P}\left(y \mid \boldsymbol{x}^{*} ; \mathcal{D}\right)\right]=-\sum_{c=1}^{K} \mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*} ; \mathcal{D}\right) \ln \left(\mathrm{P}\left(\omega_{c} \mid \boldsymbol{x}^{*} ; \mathcal{D}\right)\right)

Max probability and entropy can be seen as measures of the total uncertainty in predictions.

4.2 (yyθ\theta 之间的)互信息

{ ie. marginalizing out μ\mu in eq.4 }

  • Mutual Information between the yy and the model parameters  θ\theta measure the spread of ensemble which caused by model uncertainty. Thus, the effects of distributional uncertainty is included implicitly, cannot be distinguished.
  • This MI can be expressed as the difference of the total uncertainty (captured by the entropy of expected distribution)  and the expected data uncertainty (captured by expected entropy of each member of the ensemble), 即平均类别分布的熵,与类别分布熵的平均之差。

互信息(Mutual Information)是 信息论 里一种有用的信息度量,它可以看成是一个 随机变量 中包含的关于另一个随机变量的信息量,或者说是一个随机变量由于已知另一个随机变量而减少的不确定性。

4.3 (yyμ\mu 之间的)互信息与( DPN 的 )差分熵

{ ie. marginalizing out θ\theta in eq.4}

  • mutual information between yy and μμ: measure the spread of predictive distribution which caused by distributional uncertainty.

I[y,θx,D]Model Uncertainty =H[Ep(θD)[P(yx,θ)]]Total Uncertainty Ep(θD)[H[P(yx,θ)]]Expected Data Uncertainty \underbrace{\mathcal{I}\left[y, \boldsymbol{\theta} \mid \boldsymbol{x}^{*}, \mathcal{D}\right]}_{\text {Model Uncertainty }}=\underbrace{\mathcal{H}\left[\mathbb{E}_{\mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D})}\left[\mathrm{P}\left(y \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}\right)\right]\right]}_{\text {Total Uncertainty }}-\underbrace{\mathbb{E}_{\mathrm{p}(\boldsymbol{\theta} \mid \mathcal{D})}\left[\mathcal{H}\left[\mathrm{P}\left(y \mid \boldsymbol{x}^{*}, \boldsymbol{\theta}\right)\right]\right]}_{\text {Expected Data Uncertainty }}

Another measure of uncertainty is the differential entropy of the DPN, 当所有类别分布都是等概率时,该测度达到最大化,这发生在 Direichlet 分布为平坦的情况。差分熵非常适合测量分布不确定性,因为它在  Dirichlet 先验的预期类别分布具有高熵的时候,也会很低,并且还能够捕获数据不确定性。

H[p(μx;D)]=SK1p(μx;D)ln(p(μx;D))dμ\mathcal{H}\left[\mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*} ; \mathcal{D}\right)\right]=-\int_{\mathcal{S}^{K-1}} \mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*} ; \mathcal{D}\right) \ln \left(\mathrm{p}\left(\boldsymbol{\mu} \mid \boldsymbol{x}^{*} ; \mathcal{D}\right)\right) d \boldsymbol{\mu}

4.4 完全不确定性

The final class of measures uses the full eq.4 and assesses the spread of  p(μx;θ)p(\mu \mid x^{*};\theta) due to model uncertainty via the MI between μ\mu and θ\theta, This measure  can be computed via Bayesian ensemble approaches.

6 结论

  • This work treat out-of-distribution (OOD) inputs as a separate source of uncertainty, called Distributional Uncertainty

  • This work presents a novel framework, called Prior Networks (PN), which allows data, distributional and model uncertainty to be treated separately within a consistent probabilistically interpretable framework.

  • PNs are applied to classification

  • Dirichlet Prior Networks (DPNs)  are shown to yield more accurate estimates of distributional uncertainty than MC Dropout and standard DDNs

  • The DPNs also outperform other methods on the task of misclassification detection.

  • A range of uncertainty measures is presented and analyzed

    • measures of total uncertainty such as max probability or entropy of the predictive distribution yield the best results on misclassification detection.
    • Differential entropy of DPN was best for measure of uncertainty for OOD detection, especially when classes are less distinct
  • Uncertainty measures can be analytically calculated at test time for DPNs, reducing computational cost relative to ensemble approaches