Reading Notes | PRML (Bishop 2006)

Christopher M. Bishop, Pattern recognition and machine learning, New York: Springer, 2006.

This reading note is served as a quick reference and short summary of the PRML book.
Read the book first.
It is incorrect to trying to understand the book through reading this post.

Basically, I was just trying to simplify the book by extracting only useful definitions, equations, formulas, and explanations.
So, to understand the context here, one should study the PRML book first.

(p23) This book places a strong emphasis on the Bayesian viewpoint, reflecting the huge growth in the practical importance of Bayesian methods in the past few years, while also discussing useful frequentist concepts as required.

TODO:

  • see (#15-decision-theory), something about loss functions
  • see (#23-gaussian-distribution)
  • see (#35-evidence-approximation)
  • see all sections marked as (to read).
  • see (#57-bayesian-neural-networks)
  • see (#64-gaussian-processes)
  • see (#9-mixture-models-and-em-expectation-maximization)

1. Introduction

1.2 Probability Theory

Bayes’ theorem

p(YX)=p(XY)p(Y)p(X).p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}.

The denominator can be expanded as a summation

p(X)=Yp(XY)p(Y),p(X) = \sum_Y p(X|Y) p(Y),

which can be viewed as the normalization constant required to ensure that the sum of the conditional probability p(YX)p(Y|X) over all values of YY equals one.

Define p(Dw)p(\mathcal{D}|w) as likelihood function, Bayes’ theorem is stated in words

posteriorlikelihood×prior,(1.44)posterior \propto likelihood \times prior, \tag{1.44}

where all of these quantities are viewed as functions of w\bm{w}.

In both the Bayesian and frequentist paradigms, the likelihood function p(Dw)p(\mathcal{D}|\bm{w}) plays a central role. However, the manner in which it is used is fundamentally different in the two approaches:

  • Frequentist setting, w\bm{w} is considered to be a fixed parameter; its value is determined by an ‘estimator’, and error bars on this estimate are obtained by considering the distribution of possible data sets D\mathcal{D}.
    • a widely used frequentist estimator, maximum likelihood: w\bm{w} is set to maximize the likelihood function p(Dw)p(\mathcal{D}|\bm{w}).
    • one approach to determine frequentist error bars is the bootstrap: resampling LL times from the original NN data points with replacement.
  • Bayesian viewpoint, there is only a single data set D\mathcal{D} (the one observed), the uncertainty is expressed through a probability distribution over w\bm{w}.

Not quite understand yet… (as of 2019/06/07)

(p23) There has been much controversy and debate associated with the relative merits of the frequentist and Bayesian paradigms, which have not been helped by the fact that there is no unique frequentist, or even Bayesian, viewpoint.

This book places a strong emphasis on the Bayesian viewpoint, reflecting the huge growth in the practical importance of Bayesian methods in the past few years, while also discussing useful frequentist concepts as required.

1.2.4 The Gaussian ditribution

The maximum likelihood approach systematically underestimates the variance of the ditribution.
This is an example of a phenomenon called bias and is related to the problem of over-fitting encountered in the context of polynomial curve fitting.

Note that the bias of the maximum likelihood solution becomes less significant as the number NN of data points increases, and in the limit NliminfN\lim\inf the maximum likelihood solution for the variance equals the true variance of the distribution that generated the data.

Adopting a Bayesian approach will automatically lead to an unbiased estimation of the variance.

1.2.5 Curve fitting re-visited

The sum-of-squares error function has arisen as a consequence of maximizing likelihood under the assumption of a Gaussian noise distribution.

Introduce a prior distribution over the polynomial coefficients w\bm{w}, for simplicity, a Gaussian distritution

p(wα)=N(w0,α1I)(1.65)p(\bm{w}|\alpha) = \mathcal{N}(\bm{w}|0,\alpha^{-1}\bm{I}) \tag{1.65}

Variables such as α\alpha, which controls the distribution of model parameters, are called hyperparameters.
Then, using Bayes’ theorem, the posterior distribution is

p(wx,t,α,β)p(tx,w,β)p(wα)(1.66)p(\bm{w}|\bm{x},\bm{t},\alpha,\beta) \propto p(\bm{t}|\bm{x},\bm{w},\beta) p(\bm{w}|\alpha) \tag{1.66}

Finding the most probable value of w\bm{w} given the data by maximing the posterior distribution is called maximum posterior, MAP.

1.2.6 Bayesian curve fitting

Discuss about how to make prediction using the Bayesian fitting in Sec. 1.2.5.

The first term in (1.71) represents the uncertainty in the predicted value of t\bm{t} due to the noise on the target variables and was expressed already in the maximum likelihood predictive distribution (1.64) through βM1L\beta^{-1}_ML.
However, the second term arises from the uncertainty in the parameters w\bm{w} and is a consequence of the Bayesian treatment.

1.3 Model Selection

One major drawback of cross-validation is that the number of training runs that must be performed is increased by a factor of SS, and this can prove problematic for models in which the training is itself computationally expensive.

We need a better approach. Ideally, this should rely only on the training data and should allow multiple hyperparameters and model types to be compared in a single training run.
We therefore need to find a measure of performance which depends only on the training data and
which does not suffer from bias due to over-fitting.

Historically various ‘information criteria’ have been proposed:

  • Akaike information criterion, AIC, lnp(DwML)M\ln p(\mathcal{D}|\bm{w}_{\rm ML}) - M
  • Bayesian information criterion, BIC (see Sec.4.4.1)

  • Such criteria do not take account of the uncertainty in the mdoel parameters; tend to favour overly simple models.

In Sec. 3.4, a fully Bayesian approach lead to that complexity penalties arise in a natural and principled way.

[THIS IS EXACTLY THE TECHNIQUE WE NEED FOR THE MACHINE LEARNING APPROACH STUDY.]

1.4 The Curse of Dimensionality

1.5 Decision Theory

From 1.5.1 to 1.5.4 discuss decision theory in the context of calssification problem.

1.5.1 Minimizing the misclassification rate

1.5.2 Minimizing the expected loss

1.5.3 The reject option

1.5.4 Inference and decision

1.5.5 Loss functions for regression

1.6 Information Theory

Suppose that a sender wishes to transmit the value of a random variable to a receiver.
The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distribution p(x)p(x) and is given by

H[x]=p(x)log2p(x).(1.93)H[x] = -\sum p(x) \log_2 p(x). \tag{1.93}

This important quantity is called the entropy of the random variable xx.

This relation between entropy and shortest coding length is a general one.T
he noiseless coding theorem (Shannon, 1948) states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.

H=1NlnW=1NlnN!1Nilnni!(1.95)H = \frac{1}{N}\ln W = \frac{1}{N}\ln N! - \frac{1}{N}\sum_i\ln n_i ! \tag{1.95}

H=limN1N(NlnNN)1Ni(nilnnini)=limNlnN1Ninilnni=limNiniNlnniN=ipilnpi(1.97)\begin{aligned} H &= \lim_{N\rightarrow\infin} \frac{1}{N} (N\ln N - N) - \frac{1}{N}\sum_i (n_i \ln n_i - n_i) \\ &= \lim_{N\rightarrow\infin} \ln N - \frac{1}{N} \sum_i n_i \ln n_i\\ &= - \lim_{N\rightarrow\infin} \sum_i \frac{n_i}{N} \ln\frac{n_i}{N} = -\sum_i p_i\ln p_i \end{aligned} \tag{1.97}

The entropy of the random variable XX is defined as

H[p]=ip(xi)lnp(xi)(1.98)H[p] = - \sum_i p(x_i) \ln p(x_i) \tag{1.98}

For a density defined over multiple continuous variables, denoted collectively by the vector xx, the differential entropy is given by

H[x]=p(x)lnp(x])dx(1.104)H[\bm{x}] = - \int p(\bm{x}) \ln p(\bm{x]}) {\rm d}x \tag{1.104}

Conditional entropy of y\bm{y} given x\bm{x}:

H[xy]=H[yx]+H[x]H[\bm{x}|\bm{y}] = H[\bm{y}|\bm{x}] + H[\bm{x}]

1.6.1 RElative entropy and mutual information

Relate the ideas of information theory to pattern recognition.

Unknown p(x)p(\bm{x}), modelled using an approximation q(x)q(\bm{x}).
If construct a coding scheme for the purpose of transmitting values of x\bm{x} to a receiver, then the average additional amount of information required to specify the value of x\bm{x} as a result of using q(x)q(\bm{x}) instead of p(x)p(\bm{x}) is given by
relative entropy, or Kullback-Leibler divergence, KL divergence:

KL(pq)=p(x)ln{q(x)p(x)}dx(1.113){\rm KL}(p||q) = - \int p(\bm{x}) \ln \left\{\frac{q(\bm{x})}{p(\bm{x})}\right\} {\rm d}\bm{x} \tag{1.113}

  • KL divergence is not symmetric.

Approximate the unknown p(x)p(\bm{x}) with some parametric q(xθ)q(\bm{x}|\bm{\theta}).
Minimize the KL divergence between pp and qq with respect to θ\bm{\theta} to determine θ\bm{\theta}.
Observed a finite set of xn\bm{x}_n, for 1,,N1,\dots,N, drawn from p(x)p(\bm{x}), then the expectation with respect to pp can be approximated by a finite sum over xn\bm{x}_n, so that

KL(pq)n=1N{lnq(xnθ)+lnp(xn)}(1.119){\rm KL}(p||q) \approx \sum_{n=1}^N \left\{-\ln q(\bm{x}_n|\bm{\theta})+\ln p(\bm{x}_n)\right\} \tag{1.119}

Minimizing this Kullback-Leibler divergence is equivalent to maximizing the likelihood function.

2 PROBABILITY DISTRIBUTIONS

2.3 Gaussian Distribution

(p79) This section will be rather more technically involved than some of the earlier sections, and will require familiarity with various matrix identities.
However, we strongly encourage the reader to become proficient in manipulating Gaussian distributions using the techniques presented here as this will prove invaluable in understanding the more complex models presented in later chapters.

Gaussian Distribution:

N(xμ,Σ)=1(2π)D/21Σ1/2exp[12(xμ)TΣ1(xμ)](2.42)\mathcal{N}(\bm{x}|\bm{\mu},\Sigma) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\Sigma|^{1/2}} \exp \left[ -\frac{1}{2}(\bm{x}-\bm{\mu})^T \Sigma^{-1} (\bm{x}-\bm{\mu}) \right] \tag{2.42}

  • DD is dimension of x\bm{x}.

These equations are used for ``completing the square’':

12(xμ)TΣ1(xμ)=12(xaμa)TΛaa1(xaμa)12(xaμa)TΛab1(xbμb)=12(xbμb)TΛba1(xaμa)12(xbμb)TΛbb1(xbμb)(2.70)\begin{aligned} -\frac{1}{2}(\bm{x}-\bm{\mu})^T\Sigma^{-1}(\bm{x}-\bm{\mu}) &= -\frac{1}{2}(\bm{x}_a-\bm{\mu}_a)^T\Lambda_{aa}^{-1}(\bm{x}_a-\bm{\mu}_a) -\frac{1}{2}(\bm{x}_a-\bm{\mu}_a)^T\Lambda_{ab}^{-1}(\bm{x}_b-\bm{\mu}_b) \\ &\phantom{=} -\frac{1}{2}(\bm{x}_b-\bm{\mu}_b)^T\Lambda_{ba}^{-1}(\bm{x}_a-\bm{\mu}_a) -\frac{1}{2}(\bm{x}_b-\bm{\mu}_b)^T\Lambda_{bb}^{-1}(\bm{x}_b-\bm{\mu}_b) \end{aligned} \tag{2.70}

12(xμ)TΣ1(xμ)=12xTΣ1x+xTΣ1μ+terms independent of x(2.71)-\frac{1}{2}(\bm{x}-\bm{\mu})^T\Sigma^{-1}(\bm{x}-\bm{\mu}) = -\frac{1}{2} \bm{x}^T\Sigma^{-1}\bm{x} + \bm{x}^T\Sigma^{-1}\bm{\mu} + \text{terms independent of } \bm{x} \tag{2.71}

Conditional Gaussian distributions

Some properties of Gaussian distributions are most naturally expressed in terms of the covariance, whereas others take a simpler form when viewed in terms of the precision. (p.85)

Covariance Matrix:

Σ=Λ1=[ΣaaΣabΣbaΣbb]\Sigma = \Lambda^{-1} = \left[\begin{matrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb}\end{matrix}\right]

Precision Matrix:

Λ=Σ1=[ΛaaΛabΛbaΛbb]\Lambda = \Sigma^{-1} = \left[\begin{matrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb}\end{matrix}\right]

Conversion between covariance and precision matrixes.

[ABCD]1=[MMBD1D1CMD1+D1CMBD1]\left[\begin{matrix} A&B \\ C&D \end{matrix}\right]^{-1} = \left[\begin{matrix} M & -MBD^{-1} \\ -D^{-1}CM & D^{-1}+D^{-1}CMBD^{-1} \end{matrix}\right]

  • M=(ABD1C)1M = (A-BD^{-1}C)^{-1}

Consider the functional dependence of (2.70) on xa\bm{x}_a in which xb\bm{x}_b is regarded as a constant:

  • pick out all terms that are second order in xa\bm{x}_a
  • consider all of the terms that are linear in xa\bm{x}_a

μab=μaΛaa1Λab(xbμb)=μa+ΣabΣbb1(xbμb)(2.81)\bm{\mu}_{a|b} = \bm{\mu}_a-\Lambda_{aa}^{-1}\Lambda_{ab}(\bm{x}_b-\bm{\mu}_b) = \bm{\mu}_a + \Sigma_{ab}\Sigma_{bb}^{-1}(\bm{x}_b-\bm{\mu}_b) \tag{2.81}

Σab=Λaa1=ΣaaΣabΣbb1Σba(2.82)\Sigma_{a|b} = \Lambda_{aa}^{-1} = \Sigma_{aa}-\Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} \tag{2.82}

p(xaxb)=N(μab,Λaa1)(2.96)p(\bm{x}_a|\bm{x}_b) = \mathcal{N}(\bm{\mu}_{a|b},\Lambda_{aa}^{-1}) \tag{2.96}

Marginal Gaussian distributions

For a marginal distribution, the mean and covariance are most simply expressed
in terms of the partitioned covariance matrix, in contrast to the conditional
distribution for which the partitioned precision matrix gives rise to simpler expressions. (p.89)

MARGINAL distribution:

p(xa)=p(xa,xb)dxbp(\bm{x}_a) = \int p(\bm{x}_a,\bm{x}_b) \cdot \rm{d}\bm{x}_b

Procedure:

  • Pick out terms in (2.70) that involves xb\bm{x}_b: 12xbTΛbbxb+xbTm-\frac{1}{2}\bm{x}_b^T\Lambda_{bb}\bm{x}_b + \bm{x}_b^T\bm{m}, where m=ΛbbμbΛba(xaμa)\bm{m}=\Lambda_{bb}\bm{\mu}_b-\Lambda_{ba}(\bm{x}_a-\bm{\mu}_a).
  • Completing the square:12(xbΛbb1μb)Λbb(xbΛbb1μb)+12mTΛbb1m-\frac{1}{2}(\bm{x}_b-\Lambda_{bb}^{-1}\bm{\mu}_b)\Lambda_{bb}(\bm{x}_b-\Lambda_{bb}^{-1}\bm{\mu}_b) + \frac{1}{2}\bm{m}^T\Lambda_{bb}^{-1}\bm{m}
  • Integrating exponential of this quadratic form over xb\bm{x}_b:
    • first term is an unnormalized Gaussian, results is reciprocal of the normalization coefficient
    • second term goes into p(xa)p(\bm{x}_a) as following
  • Combining the second term with the terms from (2.70) that depend on xa\bm{x}_a, then complete the square.

Σaa=(ΛaaΛabΛbb1Λba)1(2.88)\Sigma_{aa} = (\Lambda_{aa}-\Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})^{-1} \tag{2.88}

p(xa)=N(xaμa,Σaa)(2.98)p(\bm{x}_a) = \mathcal{N}(\bm{x}_a|\bm{\mu}_a,\Sigma_{aa}) \tag{2.98}

2.3.3 Bayes’ theorem for Gaussian variables

GIVEN:

  • Gaussian marginal distribution p(x)=N(xμ,Λ1)p(\bm{x}) = \mathcal{N}(\bm{x}|\bm{\mu},\Lambda^{-1})
  • Gaussian conditional distribution p(yx)=N(yAx+b,L1)p(\bm{y}|\bm{x}) = \mathcal{N}(\bm{y}|A\bm{x}+\bm{b},L^{-1})
    • AA and LL are PRECISION matrixes

WANT:

  • Marginal distribution

p(y)=N(yAμ+b,L1+AΛ1AT)(2.111 or 2.115)p(\bm{y}) = \mathcal{N}(\bm{y}|A\bm{\mu}+\bm{b},L^{-1}+A\Lambda^{-1}A^T) \tag{2.111 or 2.115}

  • Conditional distribution

p(xy)=N(Σ[ATL(yb)+Aμ],Σ))(2.112 or 2.116)p(\bm{x}|\bm{y}) = \mathcal{N}(\Sigma[A^TL(\bm{y}-\bm{b})+A\bm{\mu}],\Sigma))\tag{2.112 or 2.116}

Σ=(Λ+ATLA)1\Sigma = (\Lambda+A^TLA)^{-1}

2.3.4 Maximum likelihood for the Gaussian

Given a dataset X={x1,,xN}T\bm{X}=\{\bm{x}_1,\dots,\bm{x}_N\}^T drawn independently from a multivariate Gaussian distribution.

Likelihood:

lnp(Xμ,Σ)(2.118 left)\ln p(\bm{X}|\bm{\mu},\Sigma) \tag{2.118 left}

Maximum likelihood estimations:

μML=1Nn=1Nxn(2.121)\bm{\mu}_{\rm ML} = \frac{1}{N} \sum_{n=1}^N \bm{x}_n \tag{2.121}

ΣML=1Nn=1N(xnμML)(xnμML)T\Sigma_{\rm ML} = \frac{1}{N} \sum_{n=1}^N (\bm{x}_n-\bm{\mu}_{\rm ML}) (\bm{x}_n-\bm{\mu}_{\rm ML})^T

Expectation of the maximum likelihood estimate:

E[μML]=1NE[xn]=E[xn](2.123)\mathbb{E}[\bm{\mu}_{\rm ML}] = \frac{1}{N} \sum \mathbb{E}[\bm{x}_n] = \mathbb{E}[\bm{x}_n] \tag{2.123}

E[ΣML]=E[1Nn(xn22μMLxn+μML2)]=1NE[nxn22NμML2+NμML2]=1NE[nxn2NμML2]=E[xn2]E[μML2]=E[xn2]1N2n,mE[xnxm]=E[xn2]1N2NE[xn2]=N1NE[xn2]=N1NΣ(2.124 + derivation)\begin{aligned} \mathbb{E}[\Sigma_{\rm ML}] &= \mathbb{E}\left[ \frac{1}{N} \sum_n \left( \bm{x}_n^2 - 2\bm{\mu}_{\rm ML}\bm{x}_n + \bm{\mu}_{\rm ML}^2 \right) \right]\\ &= \frac{1}{N} \mathbb{E}\left[ \sum_n \bm{x}_n^2 - 2N\bm{\mu}_{\rm ML}^2 + N\bm{\mu}_{\rm ML}^2 \right]\\ &= \frac{1}{N} \mathbb{E}\left[ \sum_n \bm{x}_n^2 - N\bm{\mu}_{\rm ML}^2 \right]\\ &= \mathbb{E}[\bm{x}_n^2] - \mathbb{E}[\bm{\mu}_{\rm ML}^2] \\ &= \mathbb{E}[\bm{x}_n^2] - \frac{1}{N^2}\sum_{n,m}\mathbb{E}\left[\bm{x}_n\bm{x}_m\right] \\ &= \mathbb{E}[\bm{x}_n^2] - \frac{1}{N^2}N\mathbb{E}[\bm{x}_n^2]\\ &= \frac{N-1}{N} \mathbb{E}[\bm{x}_n^2] = \frac{N-1}{N} \Sigma \end{aligned} \tag{2.124 + derivation}

Correct the biased estimate in Eq. (2.124) as:

Σ~=1N1n(xnμML)(xnμML)T\tilde{\Sigma} = \frac{1}{N-1} \sum_n (\bm{x}_n-\bm{\mu}_{\rm ML}) (\bm{x}_n-\bm{\mu}_{\rm ML})^T

(p97) Maximum likelihood framework gave point estimates for the parameters μ\bm{\mu} and Σ\Sigma.

2.3.5 Sequential estimation

2.3.6 Bayesian inference for the Gaussian (p.97)

Did not finish this part of reading.

2.3.7 Student’s t-distribution

Student’s t-distribution is obtained by adding up an infinite number of Gaussian distributions having the same means but different precisions.
This can be interpreted as an infinite mixture of Gaussians.
This gives the t-distribution an important property called robustness, which means that it is much less sensitive than the Gaussian to the presence of a few data points which are outliers.

2.3.8 Periodic variables

2.3.9 Mixtures of Gaussians

By using a sufficient number of Gaussians and by adjusting their means and covariances as well as the coefficients in the linear combination, almost any continuous density can be approximated to arbitrary accuracy. (Why it is “almost”? An counterexample?)

lnp(Xπ,μ,Σ)=(2.193)\ln p(\bm{X}|\bm{\pi},\bm{\mu},\bm{\Sigma}) = \dots \tag{2.193}

The likelihood function is now much more complex than with a single Gaussian, due to the presence of the summation over kk inside the logarithm.

The maximum likelihood solution for the parameters no longer has a closed-form analytical solution. Alternatively,

  • to sue iterative numerical optimization techniques
  • expectation maximization (see Chapter 6).

All the probatility distributions above are specific examples of a broad class of distributions called the exponential family.

2.4. The Exponential Family

The probability distributions that we have studied so far in this chapter (with the exception of the Gaussian mixture) are specific examples of a broad class of distributions called the exponential family (Duda and Hart, 1973; Bernardo and Smith, 1994).

Members of the exponential family have many important properties in common, and it is illuminating to discuss these properties in some generality.

2.5 Nonparametric Methods (p.120)

Throughout this chapter, we have focussed on the use of probability distributions having specific functional forms governed by a small number of parameters whose values are to be determined from a data set. This is called the parametric approach to density modelling.

Two widely used nonparametric techniques for desnity estimation:

  • kernel estimators
  • nearest neighbours

Both the K-nearest-neighbour method and the kernel density estimator require the entire training data set to be stored.

These nonparametric methods are still severely limited.
Simple parametric models are very restricted in terms of the forms of distribution that they can represent.
Subsequent chapters will show how to achieve density models that are very flexible and yet for which the complexity of the models can be controlled independently of the size of the training set.
[TODO: summarize how the following chapters achieves this !]

3 Linear Models for Regression

From a probabilistic perspective, we aim to model the predictive distribution p(tx)p(t|\bm{x}) because this expresses our uncertainty about the value of tt for each value of x\bm{x}.
From this conditional distribution we can make predictions of tt, for any new value of x\bm{x}, in such a way as to minimize the expected value of a suitably chosen loss function.

  • So, the ultimate goal of regression is to find p(tx)p(t|\bm{x}) !

3.1 Linear Basis Function Model

y=wTϕ(x)(3.3)y=\bm{w}^T\bm{\phi}(\bm{x}) \tag{3.3}

where w=(w0,ω1,,wM1)T\bm{w}=(w_0,\omega_1,\dots,w_{M-1})^T, ϕ=(ϕ0,ϕ1,,ϕM1)T\bm{\phi}=(\phi_0,\phi_1,\dots,\phi_{M-1})^T

3.1.1 Maximum likelihood and least squares

t=y(x,w)+ϵ(3.7)t=y(\bm{x},\bm{w})+\epsilon \tag{3.7}

p(tx,w,β)=N(ty(x,w),β1)(3.8)p(t|\bm{x},\bm{w},\beta) = \mathcal{N}(t|y(\bm{x},\bm{w}),\beta^{-1}) \tag{3.8}

Above two equations imply an assumed priori distribution, the observation model.

Now consider a data set of inputs X={x1,,xN}X = \{\bm{x}_1,\dots,\bm{x}_N\} with corresponding target values t={t1,,tN}\bm{t}=\{t_1,\dots,t_N\}.

ASSUMPTION: data points in (X,t)(X, \bm{t}) are drawn independently from the distribution (3.8).

LIKELIHOOD function:

p(tX,w,β)=n=1NN(tnwTϕ(xn),β1)(3.10)p(\bm{t}|\bm{X},\bm{w},\beta) = \prod_{n=1}^{N} \mathcal{N}(t_n|\bm{w}^T\bm{\phi}(\bm{x}_n),\beta^{-1}) \tag{3.10}

  • XX is usually omitted for simplicity

Log-likelihood function:

lnp(tw,β)=n=1Nln(12πβ1/2exp[β2(tnwTϕ(xn))T(tnwTϕ(xn))])\ln p(\bm{t}|\bm{w},\beta) = \sum_{n=1}^N \ln \left( \frac{1}{2\pi\beta^{-1/2}} \exp\left[ -\frac{\beta}{2} (\bm{t}_n-\bm{w}^T\bm{\phi}(\bm{x}_n))^T(\bm{t}_n-\bm{w}^T\bm{\phi}(\bm{x}_n)) \right] \right)

=N2ln(2π)+N2lnββ2n=1NtnwTϕ(xn)2(3.11 + 3.12)= -\frac{N}{2}\ln(2\pi) + \frac{N}{2}\ln\beta - \frac{\beta}{2} \sum_{n=1}^{N}\|\bm{t}_n-\bm{w}^T\bm{\phi}(\bm{x}_n)\|^2 \tag{3.11 + 3.12}

Note that in supervised learning problems such as regressions, we are not seeking to model the distribution of the input variables.
Thus we will drop the explicit x\bm{x} from expressions.

Having written down the likelihood function, we can use maximum likelihood to determine w\bm{w} and β\beta.
(Take gradient, and then set this gradient to zero gives)

MAX LIKELIHOOD results:

Weights wML\bm{w}_\text{ML}:

wML=(ΦTΦ)1ΦTt(3.15)\bm{w}_\text{ML}=(\Phi^T\Phi)^{-1}\Phi^T\bm{t} \tag{3.15}

where

Φ(ΦTΦ)1ΦT(3.17)\Phi^\dagger \equiv (\Phi^T\Phi)^{-1}\Phi^T \tag{3.17}

is known as the Mossre-Penrose pseudo-inverse of the matrix Φ\Phi.

Precision parameter βML\beta_{\rm ML}:

1β=1Nn=1N{tnwMLTϕ(xn)}2(3.21)\frac{1}{\beta} = \frac{1}{N} \sum_{n=1}^N \{t_n-\bm{w}_{\rm ML}^T\phi(\bm{x_n})\}^2 \tag{3.21}

3.1.2 Geomertry of least square

3.1.3 Sequential learning

3.1.4 Regularized least squares

sum-of-squares of the weight vector elements:
weight decay in machine learning literature
parameter shrinkage in statistics

other regularizers:

  • lasso

3.1.5 Multiple outputs

(p.147) In above discussions, assumed the form and number of basis functions are both fixed.

3.2 The Bias-Variance Decomposition

As we have seen in earlier chapters (FORGOT WHERE…), the phenomenon of over-fitting is really an unfortunate property of maximum likelihood and does not arise when we marginalize over parameters in a Bayesian setting.

3.3 Bayesian Linear Regression

We therefore turn to a Bayesian treatment of linear regression, which will
avoid the over-fitting problem of maximum likelihood, and which will also lead to
automatic methods of determining model complexity using the training data alone. (p.152)

CONJUGATE PRIOR distribution of w\bm{w} w.r.t. the LIKELIHOOD in Eq.3.10 (conjugate means both prior and posterior are the same distribution):

p(w)=N(wm0,S0)(3.48)p(\bm{w})=\mathcal{N}(\bm{w}|\bm{m}_0,S_0) \tag{3.48}

POSTERIOR distribution: (completing the square, or directly using Eq.2.116) (is Gaussian due to the choice of CONJUGATE prior)

p(wt)=N(wmN,SN)(3.49)p(\bm{w}|\bm{t}) = \mathcal{N}(\bm{w}|\bm{m}_N,S_N) \tag{3.49}

mN=SN(S01m0+βΦTt)\bm{m}_N = S_N(S_0^{-1}\bm{m}_0+\beta\Phi^T\bm{t})

SN1=S01+βΦTΦS_N^{-1}=S_0^{-1}+\beta\Phi^T\Phi

MAXIMUM POSTERIOR WEIGHT: (The posterior distribution is Gaussian, its mode coincides with its mean.)

wMAP=mN\bm{w}_\text{MAP} = \bm{m}_N

[Or, solve lnp(wt)/w=0\partial \ln p(\bm{w}|\bm{t})/\partial \bm{w} = 0 to get the MAP weights.]

If S0S_0 is infinitely broad, S010S_0^{-1}\rightarrow0, then the mean mN\bm{m}_N reduces to the maximum likelihood weights wML\bm{w}_\text{ML} in Eq.3.15.

(Wikipedia: Maximum a posteriori estimation) In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution (that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

While only mild conditions are required for MAP estimation to be a limiting case of Bayes estimation (under the 0-1 loss function),[1] it is not very representative of Bayesian methods in general. This is because MAP estimates are point estimates, whereas Bayesian methods are characterized by the use of distributions to summarize data and draw inferences: thus, Bayesian methods tend to report the posterior mean or median instead, together with credible intervals.

一种点估计方法。

另一个参考博客:MLE vs MAP

In the following book, usually takes:

p(wα)=N(w0,α1I)(3.52)p(\bm{w}|\alpha) = \mathcal{N}(\bm{w}|\bm{0},\alpha^{-1}\bm{I}) \tag{3.52}

LOG POSTERIOR of w\bm{w}:

lnp(wt)=β2n=1N{tnwTϕ(xn)}2α2wTw+const w.r.t. w,(3.55)\ln p(\bm{w}|\bm{t})=-\frac{\beta}{2}\sum_{n=1}^N \{t_n-\bm{w}^T\bm{\phi}(\bm{x}_n)\}^2 - \frac{\alpha}{2}\bm{w}^T\bm{w} + \text{const w.r.t. } \bm{w}, \tag{3.55}

which is equivalent to add a quadratic regularization term with coefficient λ=α/β\lambda=\alpha/\beta.

Maximization of this posterior distribution with respect to w is therefore equivalent
to the minimization of the sum-of-squares error function with the addition of a
quadratic regularization term, corresponding to Eq.3.27 with λ=α/β\lambda=\alpha/\beta.

Then, we have [Similar to how we get Eq.3.49, let gradient be zero, or use Gaussian distribution’s property.]

mN=βSNΦTt(3.53)\bm{m}_N = \beta S_N\Phi^T\bm{t} \tag{3.53}

SN1=αI+βΦTΦ(3.54)S_N^{-1}=\alpha\bm{I}+\beta\Phi^T\Phi \tag{3.54}

PREDICTED distribution: (input vectors XX are omitted in the conditioning statements to simplify the notation) (x\bm{x} is the new value, tt is its prediction)

p(tt,α,β)=p(tw,β)p(wt,α,β)dw(3.57)p(t|\bm{t},\alpha,\beta)=\int p(t|\bm{w},\beta)p(\bm{w}|\bm{t},\alpha,\beta)\cdot {\rm d}\bm{w} \tag{3.57}

=N(tmNTϕ(x),σN2(x))(3.58)= \mathcal{N}(t|\bm{m}_N^T\bm{\phi}(\bm{x}),\sigma_N^2(\bm{x})) \tag{3.58}

where

σN2(x)=β1+ϕ(x)TSNϕ(x)(3.59)\sigma_N^2(\bm{x}) = \beta^{-1} + \bm{\phi}(\bm{x})^T S_N \bm{\phi}(\bm{x}) \tag{3.59}

The first term in (3.59) represents the noise on the data.

The second term reflects the uncertainty associated with the parameters w\bm{w}.

Because the noise process and the distribution of w\bm{w} are independent Gaussians, their variances are additive.

3.4 Bayesian Model Comparison

The model evidence is sometimes also called the marginal likelihood because it can be viewed as a likelihood function over the space of models, in which the parameters have been marginalized out. (p.162)

Our uncertainty is expressed through a prior probability distribution p(Mi)p(\mathcal{M}_i). Given a training set D\mathcal{D}, we then wish to evaluate the posterior distribution

p(MiD)p(DMi)p(Mi)(3.66)p(\mathcal{M}_i|\mathcal{D}) \propto p(\mathcal{D}|\mathcal{M_i}) p(\mathcal{M}_i) \tag{3.66}

The model evidence p(DMi)p(\mathcal{D}|\mathcal{M}_i) expresses the preference shown by the data for different models.

3.5 Evidence Apparoximation

[wMAP\bm{w}_{\rm MAP} 是可以解析得到的,但是 α\alphaβ\beta 需要用某种参数估计的方法得到 point estimation,便是这一章节要解决的问题。]

Full Bayesian treatment of the linear basis function model, we introduce prior distributions over the hyperparameters α\alpha and β\beta and make predictions by marginalizing with respect to these hyperparameters as well as with respect to the parameters w\bm{w}.

However, although we can integrate analytically over either w\bm{w} or over the hyperparameters, the complete marginalization over all of these variables is analytically intractable. Here we discuss an approximation in which we set the hyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters w\bm{w}.

通过最大化把权重积分掉的marginal likelihood设置所有的超参数(学习过程),最终预测时对权重进行积分。

Empirical Bayes (statistics literature)
== type 2 maximum likelihood (statistics)
== generalized maximum likelihood (statistics)
== evidence approximation (machine learning literature)

There are two approaches to the maximization of the log marginal likelihood (log evidence): (p.166)
(These two approaches converge to the same solution.)

  1. Evaluate the evidence function analytically and then set its derivative equal to zero to obtain re-estimation equations for α\alpha and β\beta (in Section 3.5.2).
  2. Use a technique called the expectation maximization (EM) algorithm (in Section 9.3.4).

3.5.1 Evaluation of the evidence function

MARGINAL LIKELIHOOD function over the weight parameters w\bm{w}:
see Eq.(3.8) and Eq.(3.52);
using Eq.(2.115)

p(tα,β)=p(tw,β)p(wα)dw(3.77)p(\bm{t}|\alpha,\beta)=\int p(\bm{t}|\bm{w},\beta) p(\bm{w}|\alpha)\cdot {\rm d}\bm{w} \tag{3.77}

=(β2π)N/2(α2π)M/2exp{E(w)}dw(3.78)=\left(\frac{\beta}{2\pi}\right)^{N/2} \left(\frac{\alpha}{2\pi}\right)^{M/2} \int \exp\{-E(\bm{w})\} \cdot {\rm d}\bm{w} \tag{3.78}

where

E(w)=βED(w)+αEW(w)=β2tΦw2+α2wTw(3.79)E(\bm{w}) = \beta E_D(\bm{w}) + \alpha E_W(\bm{w}) = \frac{\beta}{2}\|\bm{t}-\Phi\bm{w}\|^2 + \frac{\alpha}{2}\bm{w}^T\bm{w} \tag{3.79}

completing the square

E(w)=E(mN)+12(wmN)TA(wmN)(3.80)E(\bm{w}) = E(\bm{m_N}) + \frac{1}{2}(\bm{w}-\bm{m}_N)^T A (\bm{w}-\bm{m}_N) \tag{3.80}

where

A=αI+βΦTΦ(3.81)A = \alpha\bm{I} + \beta\Phi^T\Phi \tag{3.81}

E(mN)=β2tΦw2+α2mNTmN(3.82)E(m_N)=\frac{\beta}{2}\|\bm{t}-\Phi\bm{w}\|^2+\frac{\alpha}{2}\bm{m}_N^T\bm{m}_N \tag{3.82}

mN=βA1ΦTt(3.84)\bm{m}_N=\beta A^{-1}\Phi^T\bm{t} \tag{3.84}

Seeing A=SN1A=S_N^{-1} in Eq.(3.54), Eq.(3.84) is equivalent to Eq.(3.53).

Then,

exp{E(w)}dw=exp{E(mN)}(2π)M/2A1/2(3.85)\int \exp\{-E(\bm{w})\} \cdot{\rm d}\bm{w} = \exp\{-E(\bm{m}_N)\}(2\pi)^{M/2}|A|^{-1/2} \tag{3.85}

LOG MARGINAL LIKELIHOOD function:

lnp(tα,β)=M2lnα+N2lnβE(w)12lnAN2ln2π(3.86)\ln p(\bm{t}|\alpha,\beta) = \frac{M}{2}\ln\alpha + \frac{N}{2}\ln\beta - E(\bm{w}) - \frac{1}{2}\ln|A| - \frac{N}{2}\ln{2\pi} \tag{3.86}

3.5.2 Maximizing the evidence function

α\alpha is determined by purely looking at the training data.

define the eigenvector equation: (βΦTΦ)ui=λiui(3.87)\text{define the eigenvector equation: } \left(\beta\Phi^T\Phi\right)\bm{u}_i = \lambda_i \bm{u}_i \tag{3.87}

γ=iγiα+γi(3.91)\gamma = \sum_i \frac{\gamma_i}{\alpha+\gamma_i} \tag{3.91}

α=γmNTmN(3.92)\alpha = \frac{\gamma}{\bm{m}_N^T\bm{m_N}} \tag{3.92}

This is an implicit solution for α\alpha and it can be solved iteratively:
initial guess -> mN\bm{m}_N from Eq.(3.53) -> λi\lambda_i from Eq.(3.87) -> γ\gamma from EQ.(3.91) -> re-estimate α\alpha.

Similar for β\beta.

3.6 Limitations of Fixed Basis Functions

models comprising linear combinations of fixed, nonlinear basis functions

significant short comings:
the number of basis functions needs to grow rapidly with the dimensionality of the input space, often exponentially

SVM and NN both use some good properties to alleviate this problem.

  • data vectors typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller
    • support vector machine, relevence vector machine, and neural network all use this one
  • target variables may have significant dependente on only a small number of possible directions
    • neural netowok uses this one

4. Linear Models for classification (not read yet)

5. Neural Network (not read yet)

5.7 Bayesian Neural Networks (to read)

6. Kernel Methods (not read yet)

For recent btetbooks on kernel methods, see Scholkopf and Smola (2002), Herbrich (2002), and Shawe-Taylor and Cristianini (2004).

6.1. Dual Representations

The advantage of the dual formulation is that it is expressed entirely in terms of the kernel function k(x,x)k(\bm{x},\bm{x}'). We can therefore work directly in terms of kernels and avoid the explicit introduction of the feature vector ϕ(x)\phi(\bm{x}), which allows us implicitly to use feature spaces of high, even infinite, dimensionality.

6.2. Constructing Kernels

6.3. Radial Basis Function Networks

6.4 Gaussian Processes (to read)

Here we extend the role of kernels to probabilistic discriminative models, leading to the framework of Gaussian processes.

For a linear regression models of the form y(x,w)=wTϕ(x)y(\bm{x},\bm{w}) = \bm{w}^T\phi(\bm{x}),
a prior distribution over w\bm{w} induced a corresponding prior distribution over functions y(x,w)y(\bm{x},\bm{w}).
Evaluating the posterior distribution over w\bm{w} obtains the corresponding posterior distribution over regression functions.

In the Gaussian process viewpoint, we dispense with the parametric model and instead define a prior probability distribution over functions directly.

Models equivalent to Gaussian processes in different fields:

  • kriging in geostatistics literature
  • ARMA models
  • Kalman filters
  • radial basis function networks

6.4.1. Linear regression revisited

A linear combination of MM basis functions ϕ={ϕi}\phi = \{\phi_i\}

y(x)=wTϕ(x)(6.49)y(\bm{x}) = \bm{w}^T \phi(\bm{x}) \tag{6.49}

Set a prior distribution over w\bm{w}

p(w)=N(w0,α1I)(6.50)p(\bm{w}) = \mathcal{N}(\bm{w}|\bm{0},\alpha^{-1}\bm{I}) \tag{6.50}

6.4.2. Gaussian processes for regression

6.4.3. Learning the hyperparameters

7. Sparse Kernel Mathod (not read yet)

7.2 Relevance Vector Machines

However, the key difference in the RVM is that we introduce a separate hyperparameter αi for each of the weight parameters wi instead of a single shared hyperparameter. (p.346)

p(wα)=i=1MN(wi0,αi1)(7.80)p(\bm{w}|\bm{\alpha}) = \prod_{i=1}^M \mathcal{N}(w_i|0,\alpha_i^{-1}) \tag{7.80}

where αi\alpha_i represents the precision of the corresponding parameter wiw_i.

p(wt,X,α,β)=N(wm,Σ)(7.81)p(\bm{w}|\bm{t},X,\bm{\alpha},\beta) = \mathcal{N}(\bm{w}|\bm{m},\Sigma) \tag{7.81}

where

m=βΣΦTt(7.82)\bm{m} = \beta\Sigma\Phi^T\bm{t} \tag{7.82}

Σ=(A+βΦTΦ)1(7.83)\Sigma = (A+\beta\Phi^T\Phi)^{-1} \tag{7.83}

and they are determined by maximizing the marginal likelihood function

p(tX,α,β)=p(tX,β,w)p(wα)dw(7.84)p(\bm{t}|X,\bm{\alpha},\beta) = \int p(\bm{t}|X,\bm{\beta},\bm{w})p(\bm{w}|\bm{\alpha}) \cdot {\rm d}\bm{w} \tag{7.84}

logp(tX,α,β)=lnN(t0,C)=12[Nln(2π)+lnC+tTC1t](7.85)\log p(\bm{t}|X,\bm{\alpha},\beta) = \ln \mathcal{N}(\bm{t}|0,C) = \frac{1}{2} \left[ N\ln(2\pi) + \ln|C| + \bm{t}^TC^{-1}\bm{t} \right] \tag{7.85}

where

C=β1I+ΦA1ΦT(7.86)C = \beta^{-1}I + \Phi A^{-1} \Phi^T \tag{7.86}

The values of α\bm{\alpha} and β\beta are determined using type-2 maximum likelihood, also known as the evidence approximation, in which we maximize the marginal likelihood function obtained by integrating out the weight parameters. (p.347)

It is worth emphasizing, however, that this mechanism (of automatic relevance determination) for achieving sparsity in probabilistic models through automatic relevance determination is quite general and can be applied to any model expressed as an adaptive linear combination of basis functions.

(稀疏性来自于应用automatic relevance determination, ARD算法;与SVM类似)

PREDICTIVE distribution: (after finding optimum α\bm{\alpha}^* and β\beta^*)

p(tx,X,t,α,β)=p(tx,w,β)p(wX,t,α,β)dw=N(mTϕ(x)),σ2(x)(7.90)\begin{aligned} p(t|\bm{x},X,\bm{t},\bm{\alpha}^*,\beta^*) &= \int p(t|\bm{x},\bm{w},\beta^*) p(\bm{w}|X,\bm{t},\bm{\alpha}^*,\beta^*) \cdot {\rm d}\bm{w}\\ &= \mathcal{N}(\bm{m}^T\bm{\phi}(\bm{x})),\sigma^2(\bm{x}) \tag{7.90} \end{aligned}

where

σ2(x)=(β)1+ϕ(x)TΣϕ(x)(7.91)\sigma^2(\bm{x})=(\beta^{*})^{-1} + \bm{\phi}(\bm{x})^T\Sigma\bm{\phi}(\bm{x}) \tag{7.91}

The principal disadvantage of the RVM compared to the SVM is that training
involves optimizing a nonconvex function, and training times can be longer than for a
comparable SVM.

More significantly, in the relevance vector
machine the parameters governing complexity and noise variance are determined
automatically from a single training run

8. Graphical Models (to read)

8.4. Inference in Graphical Models (to read)

9. Mixture Models and EM (expectation maximization)

9.4. The EM Algorithm in General

The expectation maximization algorithm, or EM algorithm, is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables (Dempster et al., 1977; McLachlan and Krishnan, 1997). (p.450)

denote all of the observed variables by X\bm{X}
and all of the hidden variables by Z\bm{Z} (also referred to as latent variables).
The joint distribution p(X,Zθ)p(\bm{X},\bm{Z}|\bm{\theta}) is governed by a set of parameters denoted θ\bm{\theta}.
The likelihood function is

p(Xθ)=Zp(X,Zθ)(9.69)p(\bm{X}|\bm{\theta}) = \sum_{\bm{Z}} p(\bm{X},\bm{Z}|\bm{\theta}) \tag{9.69}

where we suppose that direct optimization of p(Xθ)p(\bm{X}|\bm{\theta}) is difficult, but the optimization
of the complete-data likelihood function p(X,Zθ)p(\bm{X},\bm{Z}|\bm{\theta}) is significantly easier.

Introduce a distribution q(Z)q(\bm{Z}) defined over the latent variables, the following decomposition hold

lnp(Xθ)=L(q,θ)+KL(qp)(9.70)\ln p(\bm{X}|\bm{\theta}) = \mathcal{L}(q,\bm{\theta}) + {\rm KL}(q\|p) \tag{9.70}

where

L(q,θ)=Zq(Z)ln{p(X,Zθ)q(Z)}(9.71)\mathcal{L}(q,\bm{\theta}) = \sum_{\rm Z} q(\bm{Z}) \ln\left\{ \frac{p(\bm{X},\bm{Z}|\bm{\theta})}{q(\bm{Z})} \right\} \tag{9.71}

KL(qp)=Zq(Z)ln{p(ZX,θ)q(Z)}(9.72){\rm KL}(q\|p) = -\sum_{\bm{Z}} q(\bm{Z})\ln \left\{ \frac{p(\bm{Z}|\bm{X},\bm{\theta})}{q(\bm{Z})} \right\} \tag{9.72}

where KL is the Kullback-Leibler divergence between
q(Z)q(\bm{Z}) and the posterior distribution p(ZX,θ)p(\bm{Z}|\bm{X}, \bm{\theta}).

The EM algorithm is a two-stage iterative optimization technique for finding maximum likelihood solutions. (p.451)

  • In the E step, the lower bound L(q,θold)\mathcal{L}(q, \bm{\theta}^{\rm old}) is maximized with respect to q(Z)q(\bm{Z}) while holding θold\bm{\theta}^{\rm old} fixed.
  • In the M step, the distribution q(Z)q(\bm{Z}) is held fixed and the lower bound L(q,θ)\mathcal{L}(q, \bm{\theta}) is maximized with respect to θ\bm{\theta} to give some new value θnew\bm{\theta}^{\rm new}.

The EM algorithm breaks down the potentially difficult problem of maximizing the likelihood function into two stages, the E step and the M step, each of which will often prove simpler to implement. (p.454)

  • The generalized EM, or GEM, algorithm addresses the problem of an intractable
    M step.
  • We can similarly generalize the E step of the EM algorithm by performing a
    partial, rather than complete, optimization of L(q, θ) with respect to q(Z) (Neal and
    Hinton, 1999).

在需要使用 EM 时,还需要仔细研究这部分内容,理解不够深刻,还没有掌握到精髓所在。

10. Approximate Inference

(p24) More recently, highly efficient deterministic approximation schemes ouch as variational Bayes and expectation propagation have been developed. These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in large-scale applications.

Tasks:

  • Evaluation of the posterior distribution p(XX)p(\bm{X}|\bm{X}) given the observed data X\bm{X}.
  • Evaluation of the expectations with respect to p(XZ)p(\bm{X}|\bm{Z}).

Difficulties:

  • The dimensionality of the latent space is too high.
  • The posterior distribution is very complex such that expectations are not analytically tractable.

Approximateion schemes:

  • (Sec. 10) Deterministic (never exact results; some scale well to large applications)
    • Laplace approximateion in Sec. 4.4
    • Variational Bayes (or, variational inference): minimize KL(qp)KL(q||p)
    • Expectation propagation: minimize KL(pq)KL(p||q)
  • (Sec. 11) Stochastic (exact results given infinite computational resources; computationally demanding, small-scale problem.)
    • MCMC

10.1. Variational Inference

Entropy H[p]=p(x)lnp(x)dtH[p] = \int p(x)\ln p(x)\, dt is a functional.

The approximation l
Restricting the range of functions over which the optimization of the functional is performed.

L(q)=q(Z)ln{p(X,Z)q(Z)}(10.3)\mathcal{L}(q) = \int q(\bm{Z}) \ln\left\{ \frac{p(\bm{X},\bm{Z})}{q(\bm{Z})} \right\} \tag{10.3}

q(Z)=i=1Mqi(Zi)(10.5)q(\bm{Z}) = \prod_{i=1}^M q_i(\bm{Z}_i) \tag{10.5}

Substitute (10.5) into (10.3) and then dissect out the dependence on one of the fastors qj(Zj)q_j(\bm{Z}_j),

11. Sampling Methods

Approximate inference methods based on numerical sampling is also known as Monte Carlo techniques.

For most situations, posterior (/pɑ’stɪrɪɚ/) distribution is required primarily for the purpose of evaluating expectations, for example, in order to make predictions.

E[f]=f(z)p(z)dz(11.1)\mathbb{E}[f] = \int f(\bm{z}) p(\bm{z})\, \rm{d}\bm{z} \tag{11.1}

11.1 Basic Sampling Algorithms

11.1.1 transform using the inverse accumulative distribution function

11.1.2 rejection sampling

11.1.3 adaptive rejection sampling

adaptive rejection Metropolis sampling

Rejection sampling
- exponential decrease of acceptance rate with dimensionality
- play a role as a subroutine in more sophisticated algorithms for sampling in high dimensional spaces\

11.1.4 importance sampling

Use a proposal distribution q(z)q(\bm{z}), easy to draw samples, to calculate Eq. 11.1:

E[f]=f(z)p(z)dz=f(z)p(z)q(z)q(z)dz1Ll=1Lp(z(l))q(z(l))f(z(l))(11.19)\begin{aligned} \mathbb{E}[f] = \int f(\bm{z}) p(\bm{z})\, {\rm d}\bm{z} = \int f(\bm{z}) \frac{p(\bm{z})}{q(\bm{z})} q(\bm{z})\, {\rm d}\bm{z} \approx \frac{1}{L}\sum_{l=1}^L \frac{p(\bm{z}^{(l)})}{q(\bm{z}^{(l)})} f(\bm{z}^{(l)}) \end{aligned} \tag{11.19}

where rl=p(z(l))/q(z(l))r_l=p(\bm{z}^{(l)})/q(\bm{z}^{(l)}) is the importance weights.

When p(z)p(\bm{z}) can only be evaluated up to a normalized constant, p(z)=p~(z)/Zpp(\bm{z})=\tilde{p}(\bm{z})/Z_p, using a similar proposal distribuion q(z)=q~(z)/Zqq(\bm{z})=\tilde{q}(\bm{z})/Z_q, then we have,

E[f]=f(z)p(z)dz=ZqZpf(z)p~(z)q~(z)q~(z)dzZqZp1Ll=1Lr~lf(z(l)).(11.20)\mathbb{E}[f] = \int f(\bm{z}) p(\bm{z})\, {\rm d}\bm{z} = \frac{Z_q}{Z_p} \int f(\bm{z}) \frac{\tilde{p}(\bm{z})}{\tilde{q}(\bm{z})} \tilde{q}(\bm{z})\, {\rm d}\bm{z} \approx \frac{Z_q}{Z_p} \frac{1}{L}\sum_{l=1}^L \tilde{r}_l f(\bm{z}^{(l)}). \tag{11.20}

And because

ZpZq=1Zqp~(z)dz=1Zqp~(z)q~(z)q~(z)dz=p~(z)q~(z)q(z)dz1Ll=1Lr~l(11.21)\frac{Z_p}{Z_q} = \frac{1}{Z_q} \int \tilde{p}(\bm{z}){\rm d}\bm{z} = \frac{1}{\textcolor{red}{Z_q}} \int \frac{\tilde{p}(\bm{z})}{\tilde{q}(\bm{z})} \textcolor{red}{\tilde{q}(\bm{z})}\, {\rm d}\bm{z} = \int \frac{\tilde{p}(\bm{z})}{\tilde{q}(\bm{z})} \textcolor{red}{q(\bm{z})}\, {\rm d}\bm{z} \approx \frac{1}{L}\sum_{l=1}^L \tilde{r}_l \tag{11.21}

we then have

E[f]=l=1Lωlf(z(l))(11.22)\mathbb{E}[f] = \sum_{l=1}^L \omega_l f(\bm{z}^{(l)}) \tag{11.22}

with

ωl=r~lmr~m(11.23)\omega_l = \frac{\tilde{r}_l}{\sum_m \tilde{r}_m} \tag{11.23}

Other improvements:
- likelihood weighted sampling
- self-importance sampling

11.1.5 Sampling-importance-resampling

11.1.6 Sampling and the EM algorithm

Forget EM algorithm? See here.

Monte Carlo EM algorithm: stochastic EM

IP algorithm

11.2 Markov Chain Monte Carlo

(p24) Monte Carlo methods are very flexible and can be applied to a wide range of models. However, they are computationally intensive and have mainly been used for small-scale problems.

MCMC is a big framework: sample from many classes of distributions; scales well with the space dimension.

A central goal in designing MCMC methods is to avoid random walk behaviour.

basic Metropolis algorithm:
assume proposal distribution is symmetric, q(zAzB)=q(zBzA)q(\bm{z}_A|\bm{z}_B) = q(\bm{z}_B|\bm{z}_A),
candidate sample is accepted with probability A(z)=min(1,p~(z)p~(z(τ))A(\bm{z}^*)=\min \left( 1,\frac{\tilde{p}(\bm{z}^*)}{\tilde{p}(\bm{z}^{(\tau)}} \right)
at each step τ\tau, if the candiate z\bm{z}^* is acceptable, z(τ+1)=z\bm{z}^{(\tau+1)}=\bm{z}^*, otherwise, z(τ+1)=z(τ)\bm{z}^{(\tau+1)}=\bm{z}^{(\tau)}.

11.2.1 Markov chains

11.2.2 The Metropolis-Hastings algorithm

11.3 Gibbs Sampling

Gibbs sampling is a MCMC algorithm; a special case of the Metropolis-Hasting algorithm.

11.4 Slice Sampling

11.5 The Hybrid Monte Carlo Algorithm

Key points:
- Hamiltonian dynamics
- leapfrog integration
- Metropolis algorithm

这章与很多文献中的 Hamiltonian Monte Carlo 的关系是什么?
感觉是一样的。
是同样的东西,see MCMC: Hamiltonian Monte Carlo (a.k.a. Hybrid Monte Carlo)
.

11.5.1 Dynamical systems

11.5.2 Hybrid Monte Carlo

see MCMC: Hamiltonian Monte Carlo (a.k.a. Hybrid Monte Carlo)

11.6 Estimating the Partition Function

If we write

pE(z)=1ZEexp(E(z))p_E(\bm{z}) = \frac{1}{Z_E} \exp(-E(\bm{z}))

then the normalization constant ZEZ_E is known as the partition function.
K
nowledge of the value of ZEZ_E can be useful for Bayesian model comparison since it represents the model evidence (the probability of the observed data given the model).

(stopped here last time)

12. Continuous Latent Variables

12.4. Nonlinear Latent Variable Models (to read)

13. Sequential Data (not read for now)

14. Combining Models (not read for now)

14.1. Bayesian Model Averaging

14.2. Committees

14.3. Boosting

14.4. Tree-based Models


Some Terms

(p25) The maximum of a distribution is known as its mode.

(p28) for consistency with the notation in later chapters, we define a precision parameter parameter β\beta corresponding to the inverse variance of the distribution.

(p30) Variables such as α\alpha, which controls the distribution of model parameters, are called hyperparameters.

(p30) Finding the most probable value of w\bm{w} given the data by maximing the posterior distribution is called maximum posterior, MAP.

(p120) Parametric approach to density modelling:
the use of probability distributions having specific functional forms governed by a small number of parameters whose values are to be determined from a data set.

(p120) Nonparametric approaches to density estimation that make few assumptions about the form of the distribution.

(p165) empirical Bayes (statistics) == type 2 maximum likelihood (statistics) == generalized maximum likelihood (statistics) == evidence approximation (machine learning)

疑问

  1. complete-data log likelihood, p536
  2. complete-data parameter posterior p(θZ,X)p(\bm{\theta}|\bm{Z},\bm{X}), p537