Gal, Yarin, “Uncertainty in Deep Learning,” Doctor of Philosophy, University of Cambridge, 2016.
(p15) We will thus concentrate on the development of practical techniques to obtain model
confidence in deep learning, techniques which are also well rooted within the theoretical
foundations of probability theory and Bayesian modelling. Specifically, we will make use
of stochastic regularisation techniques (SRTs).
These techniques adapt the
model output stochastically as a way of model regularisation (hence the name stochastic
regularisation). This results in the loss becoming a random quantity, which is optimised
using tools from the stochastic non-convex optimisation literature. Popular SRTs include
dropout [Hinton et al., 2012], multiplicative Gaussian noise [Srivastava et al., 2014],
dropConnect [Wan et al., 2013], and countless other recent techniques4,5.
作者对 NN 的一些讨论
Convolutional neural networks (CNNs). CNNs [LeCun et al., 1989; Rumelhart
et al., 1985] are popular deep learning tools for image processing, which can solve tasks
that until recently were considered to lie beyond our reach [Krizhevsky et al., 2012;
Szegedy et al., 2014]. The model is made of a recursive application of convolution and
pooling layers, followed by inner product layers at the end of the network (simple NNs
as described above). A convolution layer is a linear transformation that preserves spatial
information in the input image (depicted in figure 1.1). Pooling layers simply take the
output of a convolution layer and reduce its dimensionality (by taking the maximum of
each (2, 2) block of pixels for example). The convolution layer will be explained in more
detail in section §3.4.1.
Recurrent neural networks (RNNs). RNNs [Rumelhart et al., 1985; Werbos, 1988]
are sequence-based models of key importance for natural language understanding, language
generation, video processing, and many other tasks [Kalchbrenner and Blunsom,
2013; Mikolov et al., 2010; Sundermeyer et al., 2012; Sutskever et al., 2014].
PILCO [Deisenroth and Rasmussen, 2011], for example, is a data-efficient probabilistic
model-based policy search algorithm. PILCO analytically propagates uncertain state
distributions through a Gaussian process dynamics model. This is done by recursively
feeding the output state distribution (output uncertainty) of one time step as the input
state distribution (input uncertainty) of the next time step, until a fixed time horizon T.
与 GP 的关系
使用无穷个 neuron，每个 weight 都取为高斯分布，则成为 GP。
对有限个 weights，则是 BNN。
(p14) Even though modern deep learning models used in practice do not capture model
confidence, they are closely related to a family of probabilistic models which induce
probability distributions over functions: the Gaussian process.
Given a neural network,
by placing a probability distribution over each weight (a standard normal distribution for
example), a Gaussian process can be recovered in the limit of infinitely many weights (see
Neal  or Williams ). For a finite number of weights, model uncertainty can still
be obtained by placing distributions over the weights—these models are called Bayesian
Bayesian modeling 的一些基础知识
先验分布 prior distribution:
似然函数 likelihood distribution:
反映了在当前假设的函数参数 下， 给出观测值 的概率。