Christopher M. Bishop, Pattern recognition and machine learning, New York: Springer, 2006.
This reading note is served as a quick reference and short summary of the PRML book.
Read the book first.
It is incorrect to trying to understand the book through reading this post.
Basically, I was just trying to simplify the book by extracting only useful definitions, equations, formulas, and explanations.
So, to understand the context here, one should study the PRML book first.
(p23) This book places a strong emphasis on the Bayesian viewpoint, reflecting the huge growth in the practical importance of Bayesian methods in the past few years, while also discussing useful frequentist concepts as required.
- see (#15-decision-theory), something about loss functions
see (#23-gaussian-distribution) see (#35-evidence-approximation)
- see all sections marked as (to read).
- see (#57-bayesian-neural-networks)
- see (#64-gaussian-processes)
- see (#9-mixture-models-and-em-expectation-maximization)
¶1.2 Probability Theory
The denominator can be expanded as a summation
which can be viewed as the normalization constant required to ensure that the sum of the conditional probability over all values of equals one.
Define as likelihood function, Bayes’ theorem is stated in words
where all of these quantities are viewed as functions of .
In both the Bayesian and frequentist paradigms, the likelihood function plays a central role. However, the manner in which it is used is fundamentally different in the two approaches:
- Frequentist setting, is considered to be a fixed parameter; its value is determined by an ‘estimator’, and error bars on this estimate are obtained by considering the distribution of possible data sets .
- a widely used frequentist estimator, maximum likelihood: is set to maximize the likelihood function .
- one approach to determine frequentist error bars is the bootstrap: resampling times from the original data points with replacement.
- Bayesian viewpoint, there is only a single data set (the one observed), the uncertainty is expressed through a probability distribution over .
Not quite understand yet… (as of 2019/06/07)
(p23) There has been much controversy and debate associated with the relative merits of the frequentist and Bayesian paradigms, which have not been helped by the fact that there is no unique frequentist, or even Bayesian, viewpoint.
This book places a strong emphasis on the Bayesian viewpoint, reflecting the huge growth in the practical importance of Bayesian methods in the past few years, while also discussing useful frequentist concepts as required.
¶1.2.4 The Gaussian ditribution
The maximum likelihood approach systematically underestimates the variance of the ditribution.
This is an example of a phenomenon called bias and is related to the problem of over-fitting encountered in the context of polynomial curve fitting.
Note that the bias of the maximum likelihood solution becomes less significant as the number of data points increases, and in the limit the maximum likelihood solution for the variance equals the true variance of the distribution that generated the data.
Adopting a Bayesian approach will automatically lead to an unbiased estimation of the variance.
¶1.2.5 Curve fitting re-visited
The sum-of-squares error function has arisen as a consequence of maximizing likelihood under the assumption of a Gaussian noise distribution.
Introduce a prior distribution over the polynomial coefficients , for simplicity, a Gaussian distritution
Variables such as , which controls the distribution of model parameters, are called hyperparameters.
Then, using Bayes’ theorem, the posterior distribution is
Finding the most probable value of given the data by maximing the posterior distribution is called maximum posterior, MAP.
¶1.2.6 Bayesian curve fitting
Discuss about how to make prediction using the Bayesian fitting in Sec. 1.2.5.
The first term in (1.71) represents the uncertainty in the predicted value of due to the noise on the target variables and was expressed already in the maximum likelihood predictive distribution (1.64) through .
However, the second term arises from the uncertainty in the parameters and is a consequence of the Bayesian treatment.
¶1.3 Model Selection
One major drawback of cross-validation is that the number of training runs that must be performed is increased by a factor of , and this can prove problematic for models in which the training is itself computationally expensive.
We need a better approach. Ideally, this should rely only on the training data and should allow multiple hyperparameters and model types to be compared in a single training run.
We therefore need to find a measure of performance which depends only on the training data and
which does not suffer from bias due to over-fitting.
Historically various ‘information criteria’ have been proposed:
- Akaike information criterion, AIC,
- Bayesian information criterion, BIC (see Sec.4.4.1)
Such criteria do not take account of the uncertainty in the mdoel parameters; tend to favour overly simple models.
In Sec. 3.4, a fully Bayesian approach lead to that complexity penalties arise in a natural and principled way.[THIS IS EXACTLY THE TECHNIQUE WE NEED FOR THE MACHINE LEARNING APPROACH STUDY.]
1.4 The Curse of Dimensionality
1.5 Decision Theory
From 1.5.1 to 1.5.4 discuss decision theory in the context of calssification problem.
¶1.5.1 Minimizing the misclassification rate
¶1.5.2 Minimizing the expected loss
¶1.5.3 The reject option
¶1.5.4 Inference and decision
¶1.5.5 Loss functions for regression
1.6 Information Theory
Suppose that a sender wishes to transmit the value of a random variable to a receiver.
The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distribution and is given by
This important quantity is called the entropy of the random variable .
This relation between entropy and shortest coding length is a general one.T
he noiseless coding theorem (Shannon, 1948) states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.
The entropy of the random variable is defined as
For a density defined over multiple continuous variables, denoted collectively by the vector , the differential entropy is given by
Conditional entropy of given :
¶1.6.1 RElative entropy and mutual information
Relate the ideas of information theory to pattern recognition.
Unknown , modelled using an approximation .
If construct a coding scheme for the purpose of transmitting values of to a receiver, then the average additional amount of information required to specify the value of as a result of using instead of is given by
relative entropy, or Kullback-Leibler divergence, KL divergence:
- KL divergence is not symmetric.
Approximate the unknown with some parametric .
Minimize the KL divergence between and with respect to to determine .
Observed a finite set of , for , drawn from , then the expectation with respect to can be approximated by a finite sum over , so that
Minimizing this Kullback-Leibler divergence is equivalent to maximizing the likelihood function.
2 PROBABILITY DISTRIBUTIONS
¶2.3 Gaussian Distribution
(p79) This section will be rather more technically involved than some of the earlier sections, and will require familiarity with various matrix identities.
However, we strongly encourage the reader to become proficient in manipulating Gaussian distributions using the techniques presented here as this will prove invaluable in understanding the more complex models presented in later chapters.
- is dimension of .
These equations are used for ``completing the square’':
¶Conditional Gaussian distributions
Some properties of Gaussian distributions are most naturally expressed in terms of the covariance, whereas others take a simpler form when viewed in terms of the precision. (p.85)
Conversion between covariance and precision matrixes.
Consider the functional dependence of (2.70) on in which is regarded as a constant:
- pick out all terms that are second order in
- consider all of the terms that are linear in
¶Marginal Gaussian distributions
For a marginal distribution, the mean and covariance are most simply expressed
in terms of the partitioned covariance matrix, in contrast to the conditional
distribution for which the partitioned precision matrix gives rise to simpler expressions. (p.89)
- Pick out terms in (2.70) that involves : , where .
- Completing the square:
- Integrating exponential of this quadratic form over :
- first term is an unnormalized Gaussian, results is reciprocal of the normalization coefficient
- second term goes into as following
- Combining the second term with the terms from (2.70) that depend on , then complete the square.
¶2.3.3 Bayes’ theorem for Gaussian variables
- Gaussian marginal distribution
- Gaussian conditional distribution
- and are PRECISION matrixes
- Marginal distribution
- Conditional distribution
¶2.3.4 Maximum likelihood for the Gaussian
Given a dataset drawn independently from a multivariate Gaussian distribution.
Maximum likelihood estimations:
Expectation of the maximum likelihood estimate:
Correct the biased estimate in Eq. (2.124) as:
(p97) Maximum likelihood framework gave point estimates for the parameters and .
¶2.3.5 Sequential estimation
¶2.3.6 Bayesian inference for the Gaussian (p.97)
Did not finish this part of reading.
¶2.3.7 Student’s t-distribution
Student’s t-distribution is obtained by adding up an infinite number of Gaussian distributions having the same means but different precisions.
This can be interpreted as an infinite mixture of Gaussians.
This gives the t-distribution an important property called robustness, which means that it is much less sensitive than the Gaussian to the presence of a few data points which are outliers.
¶2.3.8 Periodic variables
¶2.3.9 Mixtures of Gaussians
By using a sufficient number of Gaussians and by adjusting their means and covariances as well as the coefficients in the linear combination, almost any continuous density can be approximated to arbitrary accuracy. (Why it is “almost”? An counterexample?)
The likelihood function is now much more complex than with a single Gaussian, due to the presence of the summation over inside the logarithm.
The maximum likelihood solution for the parameters no longer has a closed-form analytical solution. Alternatively,
- to sue iterative numerical optimization techniques
- expectation maximization (see Chapter 6).
All the probatility distributions above are specific examples of a broad class of distributions called the exponential family.
¶2.4. The Exponential Family
The probability distributions that we have studied so far in this chapter (with the exception of the Gaussian mixture) are specific examples of a broad class of distributions called the exponential family (Duda and Hart, 1973; Bernardo and Smith, 1994).
Members of the exponential family have many important properties in common, and it is illuminating to discuss these properties in some generality.
¶2.5 Nonparametric Methods (p.120)
Throughout this chapter, we have focussed on the use of probability distributions having specific functional forms governed by a small number of parameters whose values are to be determined from a data set. This is called the parametric approach to density modelling.
Two widely used nonparametric techniques for desnity estimation:
- kernel estimators
- nearest neighbours
Both the K-nearest-neighbour method and the kernel density estimator require the entire training data set to be stored.
These nonparametric methods are still severely limited.
Simple parametric models are very restricted in terms of the forms of distribution that they can represent.
Subsequent chapters will show how to achieve density models that are very flexible and yet for which the complexity of the models can be controlled independently of the size of the training set.
[TODO: summarize how the following chapters achieves this !]
3 Linear Models for Regression
From a probabilistic perspective, we aim to model the predictive distribution because this expresses our uncertainty about the value of for each value of .
From this conditional distribution we can make predictions of , for any new value of , in such a way as to minimize the expected value of a suitably chosen loss function.
- So, the ultimate goal of regression is to find !
¶3.1 Linear Basis Function Model
¶3.1.1 Maximum likelihood and least squares
Above two equations imply an assumed priori distribution, the observation model.
Now consider a data set of inputs with corresponding target values .
ASSUMPTION: data points in are drawn independently from the distribution (3.8).
- is usually omitted for simplicity
Note that in supervised learning problems such as regressions, we are not seeking to model the distribution of the input variables.
Thus we will drop the explicit from expressions.
Having written down the likelihood function, we can use maximum likelihood to determine and .
(Take gradient, and then set this gradient to zero gives)
MAX LIKELIHOOD results:
is known as the Mossre-Penrose pseudo-inverse of the matrix .
Precision parameter :
¶3.1.2 Geomertry of least square
¶3.1.3 Sequential learning
¶3.1.4 Regularized least squares
sum-of-squares of the weight vector elements:
weight decay in machine learning literature
parameter shrinkage in statistics
¶3.1.5 Multiple outputs
(p.147) In above discussions, assumed the form and number of basis functions are both fixed.
¶3.2 The Bias-Variance Decomposition
As we have seen in earlier chapters (FORGOT WHERE…), the phenomenon of over-fitting is really an unfortunate property of maximum likelihood and does not arise when we marginalize over parameters in a Bayesian setting.
¶3.3 Bayesian Linear Regression
We therefore turn to a Bayesian treatment of linear regression, which will
avoid the over-fitting problem of maximum likelihood, and which will also lead to
automatic methods of determining model complexity using the training data alone. (p.152)
CONJUGATE PRIOR distribution of w.r.t. the LIKELIHOOD in Eq.3.10 (conjugate means both prior and posterior are the same distribution):
POSTERIOR distribution: (completing the square, or directly using Eq.2.116) (is Gaussian due to the choice of CONJUGATE prior)
MAXIMUM POSTERIOR WEIGHT: (The posterior distribution is Gaussian, its mode coincides with its mean.)
[Or, solve to get the MAP weights.]
If is infinitely broad, , then the mean reduces to the maximum likelihood weights in Eq.3.15.
(Wikipedia: Maximum a posteriori estimation) In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution (that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.
While only mild conditions are required for MAP estimation to be a limiting case of Bayes estimation (under the 0-1 loss function), it is not very representative of Bayesian methods in general. This is because MAP estimates are point estimates, whereas Bayesian methods are characterized by the use of distributions to summarize data and draw inferences: thus, Bayesian methods tend to report the posterior mean or median instead, together with credible intervals.
In the following book, usually takes:
LOG POSTERIOR of :
which is equivalent to add a quadratic regularization term with coefficient .
Maximization of this posterior distribution with respect to w is therefore equivalent
to the minimization of the sum-of-squares error function with the addition of a
quadratic regularization term, corresponding to
Then, we have [Similar to how we get Eq.3.49, let gradient be zero, or use Gaussian distribution’s property.]
PREDICTED distribution: (input vectors are omitted in the conditioning statements to simplify the notation) ( is the new value, is its prediction)
The first term in (3.59) represents the noise on the data.
The second term reflects the uncertainty associated with the parameters .
Because the noise process and the distribution of are independent Gaussians, their variances are additive.
¶3.4 Bayesian Model Comparison
The model evidence is sometimes also called the marginal likelihood because it can be viewed as a likelihood function over the space of models, in which the parameters have been marginalized out. (p.162)
Our uncertainty is expressed through a prior probability distribution . Given a training set , we then wish to evaluate the posterior distribution
The model evidence expresses the preference shown by the data for different models.
¶3.5 Evidence Apparoximation
[ 是可以解析得到的，但是 和 需要用某种参数估计的方法得到 point estimation，便是这一章节要解决的问题。]
Full Bayesian treatment of the linear basis function model, we introduce prior distributions over the hyperparameters and and make predictions by marginalizing with respect to these hyperparameters as well as with respect to the parameters .
However, although we can integrate analytically over either or over the hyperparameters, the complete marginalization over all of these variables is analytically intractable. Here we discuss an approximation in which we set the hyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters .
Empirical Bayes (statistics literature)
== type 2 maximum likelihood (statistics)
== generalized maximum likelihood (statistics)
== evidence approximation (machine learning literature)
There are two approaches to the maximization of the log marginal likelihood (log evidence): (p.166)
(These two approaches converge to the same solution.)
- Evaluate the evidence function analytically and then set its derivative equal to zero to obtain re-estimation equations for and (in Section 3.5.2).
- Use a technique called the expectation maximization (EM) algorithm (in Section 9.3.4).
¶3.5.1 Evaluation of the evidence function
MARGINAL LIKELIHOOD function over the weight parameters :
see Eq.(3.8) and Eq.(3.52);
completing the square
Seeing in Eq.(3.54), Eq.(3.84) is equivalent to Eq.(3.53).
LOG MARGINAL LIKELIHOOD function:
¶3.5.2 Maximizing the evidence function
is determined by purely looking at the training data.
This is an implicit solution for and it can be solved iteratively:
initial guess -> from Eq.(3.53) -> from Eq.(3.87) -> from EQ.(3.91) -> re-estimate .
Similar for .
¶3.6 Limitations of Fixed Basis Functions
models comprising linear combinations of fixed, nonlinear basis functions
significant short comings:
the number of basis functions needs to grow rapidly with the dimensionality of the input space, often exponentially
SVM and NN both use some good properties to alleviate this problem.
- data vectors typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller
- support vector machine, relevence vector machine, and neural network all use this one
- target variables may have significant dependente on only a small number of possible directions
- neural netowok uses this one
4. Linear Models for classification (not read yet)
5. Neural Network (not read yet)
¶5.7 Bayesian Neural Networks (to read)
6. Kernel Methods (not read yet)
For recent btetbooks on kernel methods, see Scholkopf and Smola (2002), Herbrich (2002), and Shawe-Taylor and Cristianini (2004).
¶6.1. Dual Representations
The advantage of the dual formulation is that it is expressed entirely in terms of the kernel function . We can therefore work directly in terms of kernels and avoid the explicit introduction of the feature vector , which allows us implicitly to use feature spaces of high, even infinite, dimensionality.
¶6.2. Constructing Kernels
¶6.3. Radial Basis Function Networks
¶6.4 Gaussian Processes (to read)
Here we extend the role of kernels to probabilistic discriminative models, leading to the framework of Gaussian processes.
For a linear regression models of the form ,
a prior distribution over induced a corresponding prior distribution over functions .
Evaluating the posterior distribution over obtains the corresponding posterior distribution over regression functions.
In the Gaussian process viewpoint, we dispense with the parametric model and instead define a prior probability distribution over functions directly.
Models equivalent to Gaussian processes in different fields:
- kriging in geostatistics literature
- ARMA models
- Kalman filters
- radial basis function networks
¶6.4.1. Linear regression revisited
A linear combination of basis functions
Set a prior distribution over
¶6.4.2. Gaussian processes for regression
¶6.4.3. Learning the hyperparameters
7. Sparse Kernel Mathod (not read yet)
¶7.2 Relevance Vector Machines
However, the key difference in the RVM is that we introduce a separate hyperparameter αi for each of the weight parameters wi instead of a single shared hyperparameter. (p.346)
where represents the precision of the corresponding parameter .
and they are determined by maximizing the marginal likelihood function
The values of and are determined using type-2 maximum likelihood, also known as the evidence approximation, in which we maximize the marginal likelihood function obtained by integrating out the weight parameters. (p.347)
It is worth emphasizing, however, that this mechanism (of automatic relevance determination) for achieving sparsity in probabilistic models through automatic relevance determination is quite general and can be applied to any model expressed as an adaptive linear combination of basis functions.
（稀疏性来自于应用automatic relevance determination, ARD算法；与SVM类似）
PREDICTIVE distribution: (after finding optimum and )
The principal disadvantage of the RVM compared to the SVM is that training
involves optimizing a nonconvex function, and training times can be longer than for a
More significantly, in the relevance vector
machine the parameters governing complexity and noise variance are determined
automatically from a single training run
8. Graphical Models (to read)
¶8.4. Inference in Graphical Models (to read)
9. Mixture Models and EM (expectation maximization)
¶9.4. The EM Algorithm in General
The expectation maximization algorithm, or EM algorithm, is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables (Dempster et al., 1977; McLachlan and Krishnan, 1997). (p.450)
denote all of the observed variables by
and all of the hidden variables by (also referred to as latent variables).
The joint distribution is governed by a set of parameters denoted .
The likelihood function is
where we suppose that direct optimization of is difficult, but the optimization
of the complete-data likelihood function is significantly easier.
Introduce a distribution defined over the latent variables, the following decomposition hold
where KL is the Kullback-Leibler divergence between
and the posterior distribution .
The EM algorithm is a two-stage iterative optimization technique for finding maximum likelihood solutions. (p.451)
- In the E step, the lower bound is maximized with respect to while holding fixed.
- In the M step, the distribution is held fixed and the lower bound is maximized with respect to to give some new value .
The EM algorithm breaks down the potentially difficult problem of maximizing the likelihood function into two stages, the E step and the M step, each of which will often prove simpler to implement. (p.454)
- The generalized EM, or GEM, algorithm addresses the problem of an intractable
- We can similarly generalize the E step of the EM algorithm by performing a
partial, rather than complete, optimization of L(q, θ) with respect to q(Z) (Neal and
在需要使用 EM 时，还需要仔细研究这部分内容，理解不够深刻，还没有掌握到精髓所在。
10. Approximate Inference
(p24) More recently, highly efficient deterministic approximation schemes ouch as variational Bayes and expectation propagation have been developed. These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in large-scale applications.
- Evaluation of the posterior distribution given the observed data .
- Evaluation of the expectations with respect to .
- The dimensionality of the latent space is too high.
- The posterior distribution is very complex such that expectations are not analytically tractable.
- (Sec. 10) Deterministic (never exact results; some scale well to large applications)
- Laplace approximateion in Sec. 4.4
- Variational Bayes (or, variational inference): minimize
- Expectation propagation: minimize
- (Sec. 11) Stochastic (exact results given infinite computational resources; computationally demanding, small-scale problem.)
¶10.1. Variational Inference
Entropy is a functional.
The approximation l
Restricting the range of functions over which the optimization of the functional is performed.
Substitute (10.5) into (10.3) and then dissect out the dependence on one of the fastors ,
11. Sampling Methods
Approximate inference methods based on numerical sampling is also known as Monte Carlo techniques.
For most situations, posterior (/pɑ’stɪrɪɚ/) distribution is required primarily for the purpose of evaluating expectations, for example, in order to make predictions.
¶11.1 Basic Sampling Algorithms
¶11.1.1 transform using the inverse accumulative distribution function
¶11.1.2 rejection sampling
¶11.1.3 adaptive rejection sampling
adaptive rejection Metropolis sampling
- exponential decrease of acceptance rate with dimensionality
- play a role as a subroutine in more sophisticated algorithms for sampling in high dimensional spaces\
¶11.1.4 importance sampling
Use a proposal distribution , easy to draw samples, to calculate Eq. 11.1:
where is the importance weights.
When can only be evaluated up to a normalized constant, , using a similar proposal distribuion , then we have,
we then have
- likelihood weighted sampling
- self-importance sampling
¶11.1.6 Sampling and the EM algorithm
Monte Carlo EM algorithm: stochastic EM
¶11.2 Markov Chain Monte Carlo
(p24) Monte Carlo methods are very flexible and can be applied to a wide range of models. However, they are computationally intensive and have mainly been used for small-scale problems.
MCMC is a big framework: sample from many classes of distributions; scales well with the space dimension.
A central goal in designing MCMC methods is to avoid random walk behaviour.
basic Metropolis algorithm:
assume proposal distribution is symmetric, ,
candidate sample is accepted with probability
at each step , if the candiate is acceptable, , otherwise, .
¶11.2.1 Markov chains
¶11.2.2 The Metropolis-Hastings algorithm
¶11.3 Gibbs Sampling
Gibbs sampling is a MCMC algorithm; a special case of the Metropolis-Hasting algorithm.
¶11.4 Slice Sampling
¶11.5 The Hybrid Monte Carlo Algorithm
- Hamiltonian dynamics
- leapfrog integration
- Metropolis algorithm
这章与很多文献中的 Hamiltonian Monte Carlo 的关系是什么？
是同样的东西，see MCMC: Hamiltonian Monte Carlo (a.k.a. Hybrid Monte Carlo)
¶11.5.1 Dynamical systems
¶11.5.2 Hybrid Monte Carlo
¶11.6 Estimating the Partition Function
If we write
then the normalization constant is known as the partition function.
nowledge of the value of can be useful for Bayesian model comparison since it represents the model evidence (the probability of the observed data given the model).
(stopped here last time)
12. Continuous Latent Variables
¶12.4. Nonlinear Latent Variable Models (to read)
13. Sequential Data (not read for now)
14. Combining Models (not read for now)
¶14.1. Bayesian Model Averaging
¶14.4. Tree-based Models
(p25) The maximum of a distribution is known as its mode.
(p28) for consistency with the notation in later chapters, we define a precision parameter parameter corresponding to the inverse variance of the distribution.
(p30) Variables such as , which controls the distribution of model parameters, are called hyperparameters.
(p30) Finding the most probable value of given the data by maximing the posterior distribution is called maximum posterior, MAP.
(p120) Parametric approach to density modelling:
the use of probability distributions having specific functional forms governed by a small number of parameters whose values are to be determined from a data set.
(p120) Nonparametric approaches to density estimation that make few assumptions about the form of the distribution.
(p165) empirical Bayes (statistics) == type 2 maximum likelihood (statistics) == generalized maximum likelihood (statistics) == evidence approximation (machine learning)
- complete-data log likelihood, p536
- complete-data parameter posterior , p537