Gal, Yarin, “Uncertainty in Deep Learning,” Doctor of Philosophy, University of Cambridge, 2016.

# 全文的主要贡献

(p15) We will thus concentrate on the development of practical techniques to obtain model
confidence in deep learning, techniques which are also well rooted within the theoretical
foundations of probability theory and Bayesian modelling. Specifically, we will make use
of stochastic regularisation techniques (SRTs).

model output stochastically as a way of model regularisation (hence the name stochastic
regularisation). This results in the loss becoming a random quantity, which is optimised
using tools from the stochastic non-convex optimisation literature. Popular SRTs include
dropout [Hinton et al., 2012], multiplicative Gaussian noise [Srivastava et al., 2014],
dropConnect [Wan et al., 2013], and countless other recent techniques4,5.

# 作者对 NN 的一些讨论

## ¶CNN

Convolutional neural networks (CNNs). CNNs [LeCun et al., 1989; Rumelhart
et al., 1985] are popular deep learning tools for image processing, which can solve tasks
that until recently were considered to lie beyond our reach [Krizhevsky et al., 2012;
Szegedy et al., 2014]. The model is made of a recursive application of convolution and
pooling layers, followed by inner product layers at the end of the network (simple NNs
as described above). A convolution layer is a linear transformation that preserves spatial
information in the input image (depicted in figure 1.1). Pooling layers simply take the
output of a convolution layer and reduce its dimensionality (by taking the maximum of
each (2, 2) block of pixels for example). The convolution layer will be explained in more
detail in section §3.4.1.

## ¶RNN

Recurrent neural networks (RNNs). RNNs [Rumelhart et al., 1985; Werbos, 1988]
are sequence-based models of key importance for natural language understanding, language
generation, video processing, and many other tasks [Kalchbrenner and Blunsom,
2013; Mikolov et al., 2010; Sundermeyer et al., 2012; Sutskever et al., 2014].

## ¶PILCO

PILCO [Deisenroth and Rasmussen, 2011], for example, is a data-efficient probabilistic
model-based policy search algorithm. PILCO analytically propagates uncertain state
distributions through a Gaussian process dynamics model. This is done by recursively
feeding the output state distribution (output uncertainty) of one time step as the input
state distribution (input uncertainty) of the next time step, until a fixed time horizon T.

## ¶与 GP 的关系

(p14) Even though modern deep learning models used in practice do not capture model
confidence, they are closely related to a family of probabilistic models which induce
probability distributions over functions: the Gaussian process.
Given a neural network,
by placing a probability distribution over each weight (a standard normal distribution for
example), a Gaussian process can be recovered in the limit of infinitely many weights (see
Neal [1995] or Williams [1997]). For a finite number of weights, model uncertainty can still
be obtained by placing distributions over the weights—these models are called Bayesian
neural networks.

# Bayesian modeling 的一些基础知识

$p(\bm{\omega})$

$p(\bm{y}|\bm{x},\bm{\omega})$

## ¶

$p(\bm{\omega} | \bm{X},\bm{Y}) = \frac{p(\bm{Y} | \bm{X},\bm{\omega}) p(\bm{\omega})}{p(\bm{Y}|\bm{X})}$

$p(\bm{y}^*|\bm{x}^*,\bm{X},\bm{Y}) = \int p(\bm{y}^*|\bm{x}^*,\bm{\omega})\, p(\bm{\omega}|\bm{X},\bm{Y})\, \rm{d}\bm{\omega}$ … (eq-1)

## ¶娈分推断 variational inference (Bayesian)（我第一次听说）

• 尽可能地解释、拟合、接近数据的分布
• 尽可能地接近先验分布（起到 Occam razor 的作用）

• 复杂模型和解释数据之间的平衡
• 生成包括模型不确定性的概率模型

This technique does not
scale to large data (evaluating R
qθ(ω) log p(Y|X,ω)dω requires calculations over the
entire dataset), and the approach does not adapt to complex models (models in which
this last integral cannot be evaluated analytically). Recent advances in VI allow us to
circumvent these difficulties, and we will get back to this topic later in §3.1.

Kullback–Leibler (KL) divergence:
${\rm KL}(q_\theta(\bm{\omega})||p(\bm{\omega}||\bm{X},\bm{Y})) = \int q_\theta(\bm{\omega}) \log \frac{q_\theta(\bm{\omega})}{p(\bm{\omega}|\bm{X},\bm{Y})}, {\rm d}\bm{\omega}$

(Evidence Lower Bound (ELOB):
$\mathcal{L}{\rm VI} = \int q\theta(\bm{\omega}), \log p(\bm{Y}|\bm{X},\bm{\omega}), {\rm d}\bm{\omega} - {\rm KL}( q_\theta(\bm{\omega})||p(\bm{\omega}) )$

$\log p(\bm{Y}|\bm{X}) - {\rm KL}(q_\theta(\bm{\omega})||p(\bm{\omega}|\bm{X},\bm{Y}))$

$\mathbb{E}[\log p(\bm{\omega},\bm{Y}|\bm{X})] + \mathbb{H}(\bm{\omega})$

Xitong YANG, Understanding the Variational Lower Bound，其中提到关于计算方面的考虑，为什么通常选择最大化 $\mathcal{L}$

# Bayesian neural networks

Hinton 和 Van Camp 提出和采用把 $q(\bm{\omega})$ 分散为独立分布的乘积。

Neal 提出和采用 HMC 直接对后验分布采样。

## ¶HMC

Hamiltonian Monte Carlo, also Hybrid Monte Carlo

was suggested for posterior inference, a technique based on dynamical
simulation that does not rely on any prior assumptions about the form of the posterior
distribution.

HMC makes use of Hamiltonian dynamics in
MCMC [Duane et al., 1987], following Newton’s laws of motion [Newton, 1687].

HMC
for example, even though shown to obtain good results, does not scale to large data
[Neal, 1995], and it is difficult to explain the technique to non-experts.

Neal
[1995] further studied different prior distributions in Bayesian NNs, and showed that in
the limit of the number of units the model would converge to various stable processes,
depending on the prior used (for example, the model would converge to a Gaussian
process when a Gaussian prior is used).

# Bayesian Deep Learning

## ¶Monte Carlo estimators in variational inference

$I(\theta) = \frac{\partial}{\partial \theta} \int f(x) p_\theta(x)\, {\rm d}x$

• score function estimator $\hat{I}_1$ (likelihood ratio estimator). Variance 较高，实际中常与 variance reduction technique 一同使用。

$\frac{\partial}{\partial \theta} \int f(x) p_\theta(x)\, {\rm d}x = \int f(x) \frac{\partial}{\partial \theta} p_\theta(x)\, {\rm d}x = \int f(x) \frac{\partial \log p_\theta(x)}{\partial \theta} p_\theta(x)\, {\rm d}x$

• pathwise derivative estimator $\hat{I}_2$ (re-parametrisation trick, infinitesimal perturbation analysis, and stochastic backpropagation). 假设 $p_\theta(x)$ 可以改写为无参数的分布 $p(\epsilon)$，则 $x = g(\theta,\epsilon)$ 是确定的可微的双参数变换。

\begin{aligned} \frac{\partial}{\partial \theta} \int f(x) p_\theta(x)\, {\rm d}x &= \frac{\partial}{\partial \theta} \int f(x) \left( \int p_\theta(x,\epsilon)\, {\rm d}\epsilon \right)\, {\rm d}x \\ &= \frac{\partial}{\partial \theta} \iint f(x) p_\theta(x|\epsilon) p(\epsilon)\, {\rm d}\epsilon{\rm d}x \\ &= \frac{\partial}{\partial \theta} \int \left( \int f(x) \sigma\left(x-g(\theta,\epsilon)\right)\, {\rm d}x \right) p(\epsilon)\, {\rm d}\epsilon \\ &= \frac{\partial}{\partial \theta} \int f(g(\theta,\epsilon))p(\epsilon)\, {\rm d}\epsilon \\ &= \int f'(g(\theta,\epsilon)) \frac{\partial}{\partial \theta} g(\theta,\epsilon) p(\epsilon)\, {\rm d}\epsilon \end{aligned}

• characteristic function estimator $\hat{I}_3$. 依赖于 Gaussian 分布的特征函数。$\frac{\partial}{\partial \mu}$$\hat{I}_2$ 完全相同，

$\frac{\partial}{\partial \sigma} \int f(x) p_\theta(x)\, {\rm d}x = 2\sigma \cdot \frac{1}{2} \int f''(x) p_\theta (x)\, {\rm d}x$

（这里应该有两个图，但是pandoc的markdown语法不会插图，会导致编译失败，暂时注释掉了。）

## ¶Model Uncertainty in BNN

(Predictive variance and posterior variance). It is important to note the difference between the variance of the approximating distribution qθ(ω) and the variance of the predictive distribution qθ(y|x) (eq. (3.15)).

### ¶对 regression 问题：

We will perform moment-matching and estimate the first two moments of the predictive
distribution empirically. The first moment can be estimated as follows:

（证明过程在 pp47–48）

（利用了 $\mathbb{E}[x^2] = \mathbb{E}(x)^2 + {\rm Var}(x)$，通过一、二阶矩来计算方差。）

• 应用到目前的轨道预测问题时，$\tau$ 是否应该是个常数？
• 如果不是常数，怎么拓展到 $\tau(\Delta t)$

### ¶现实中的几个局限性：

1. 训练时间不变，测试时间变T倍，但是对于GPU计算影响不大，因为有mini-batch;
2. 模型的uncertainty没有很好的办法进行校准，现实中会导致对不同的数据集不确定性不同（没太看懂）;
3. 变分推断VI已知会低估predictive variance。

# 具体研究开源的代码

## ¶4.3 Quantitative comparison

https://github.com/yaringal/DropoutUncertaintyExps

YearPredictionMSD

repo 中没有 data：data too big to fit in repo; please get in touch for dataset

bostonHousing

• from keras.models import Sequential
• from keras.layers.core import Dense, Dropout, Activation

Gal 的代码是 2016 年的版本：DropoutUncertaintyExps。

Bayesian Optimization 参数的部分，spearmint 和 tensorflow=1.1.0 要求的 protobuf 版本依赖。（暂时没有解决，不会配置）

## ¶4.2

Gal 主要实现了 4 种 VI 算法：

• Bernoulli approximating distribution, implemented as dropout:
• Multiplicative Gaussian approximating distribution, implemented as multiplicative
Gaussian noise (MGN):
• fully factorised Gaussian distribution
• mixture of Gaussians (MoG) with two mixture components, factorised over
the rows of the weight matrices

## ¶4.6 Heteroscedastic uncertainty

https://github.com/yaringal/HeteroscedasticDropoutUncertainty

Homoscedastic regression assumes identical observation noise for every input
point x. Heteroscedastic regression, on the other hand, assumes that observation noise
can vary with input x [Le et al., 2005].

• Homoscedastic/同方差的
• Heteroscedastic/异方差的

Heteroscedastic uncertainty 每个数据点 $\bm{x}$ 的观测噪音不同。