0%

# 理论部分摘抄和注释

## From prior to posterior predictive

Bayes

Bayes inference

Observation model: $\bm{y}|\bm{f},\phi \sim \prod_{i=1}^n p(y_i|f_i,\phi)$

GP prior: $f(\bm{x})|\theta \sim \mathcal{GP}\left(m(\bm{x}), k(\bm{x},\bm{x}'|\theta)\right)$

hyperprior: $\vartheta \triangleq [\theta,\phi] \sim p(\theta)p(\phi)$

The latent function value $f(\bm{x})$ at fixed $\bm{x}$ is called a latent variable.

Any set of function values $\bm{f} \triangleq [f_1,f_2,\dots]^T$ has a multivariate Gaussian distribution

$p(\bm{f}|\bm{X},\theta) = N(\bm{f}|\bm{0},\bm{K}_{f,f})$

Predict the values $\tilde{\bm{f}}$ at new input locations $\tilde{\bm{X}}$, the joint distribution is

$\begin{bmatrix} \bm{f}\\ \tilde{\bm{f}} \end{bmatrix}| \bm{X}, \tilde{\bm{X}}, \theta \sim N\left(\bm{0}, \begin{bmatrix} K_{f,f} & K_{f,\tilde{f}}\\ K_{\tilde{f},f} & K_{\tilde{f},\tilde{f}} \end{bmatrix}\right)$

The conditional distribution of $\tilde{\bm{f}}$ given $\bm{f}$ is

$\tilde{f} | \bm{f},\bm{X},\tilde{\bm{X}}, \theta \sim N(\bm{K}_{\tilde{f},f}\bm{K}_{f,f}^{-1}\bm{f},\, \bm{K}_{\tilde{f},\tilde{f}}-\bm{K}_{\tilde{f},f}\bm{K}_{f,f}^{-1}\bm{K}_{f,\tilde{f}})$

So the conditional distribution of the latent function $f(\tilde{\bm{x}})$ is also a GP with

• conditional mean function: $\textcolor{green}{\mathbb{E}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[f(\tilde{\bm{x}})]} = k(\tilde{\bm{x}},\bm{X}|\theta) \bm{K}_{f,f}^{-1} \bm{f}$
• conditional covariance function: $\textcolor{green}{\text{Cov}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[f(\tilde{\bm{x}})]} = k(\tilde{\bm{x}},\tilde{\bm{x}}') - k(\tilde{\bm{x}},\bm{X}|\theta)\bm{K}_{f,f}^{-1}k(\bm{X},\tilde{\bm{x}}'|\theta)$ （不确定等号左边的符号是否正确）

First inference step is to form the conditional posterior of the latenet variables $\bm{f}$ given the parameters $\vartheta$ ($\mathcal{D}\triangleq\{\bm{X},\bm{y}\}$) （这里暂时假设已经取得，依赖于 observation model 的选择或设计，在后面会讨论如何计算，实际上除了经典GP都需要用近似方法）

$p(\bm{f}|\mathcal{D},\theta,\phi) = \frac{ \overbrace{p(\bm{y}|\bm{f},\phi)}^\text{observation model} \overbrace{p(\bm{f}|\bm{X},\theta)}^\text{GP prior} }{ \int p(\bm{y}|\bm{f},\phi) p(\bm{f}|\bm{X},\theta) d\bm{f} \textcolor{green}{\triangleq p(\bm{y}|\bm{X},\vartheta)}} \tag{8}$

After this, we can marginalize over the parameters $\vartheta$ to obtain the marginal posterior distribution for the latent vriables $\bm{f}$

$p(\bm{f}|\mathcal{D}) = \int \overbrace{p(\bm{f}|\mathcal{D},\theta,\phi)}^\text{see above} \overbrace{p(\theta,\phi|\mathcal{D})}^\text{hyperprior} d\theta d\phi$

The conditional posterior predictive distribution $p(\tilde{f}|\mathcal{D},\vartheta,\tilde{\bm{x}})$ can be evaluated exactly or approximated, （同样，在后面会讨论如何计算，这里暂时假设已经取得）

\color{red} \begin{aligned} p(\tilde{f}|\mathcal{D},\vartheta,\tilde{\bm{x}}) &= \int p(\tilde{f},\bm{f}|\mathcal{D},\vartheta,\tilde{\bm{x}}) \, d\bm{f} \\ &= \int \overbrace{ p(\tilde{f}|\bm{f},\bm{X},\vartheta,\tilde{\bm{x}}) }^\text{got from GP prior} \cdot \overbrace{ p(\bm{f}|\mathcal{D},\vartheta) }^\text{got from Bayes' theorem} \, d\bm{f} \\ &\text{(Not sure if this is correct.)} \end{aligned}

The posterior predictive distribution $p(\tilde{f}|\mathcal{D},\tilde{\bm{x}})$ is obtained by marginalizing out the parameters $\vartheta$ from $p(\tilde{f}|\mathcal{D},\vartheta,\tilde{\bm{x}})$.

The posterior joint predictive distribution $p(\tilde{\bm{y}}|\mathcal{D},\theta,\phi,\tilde{\bm{x}})$ requires integration over $p(\tilde{f}|\mathcal{D},\vartheta,\tilde{\bm{x}})$. (Usually not used.)

The marginal predicted distribution for individual $\tilde{y}_i$ is

$p(\tilde{y}_i|\mathcal{D},\tilde{\bm{x}}_i,\theta,\phi) = \int p(\tilde{y}_i|\tilde{f}_i,\phi) p(\tilde{f}_i|\mathcal{D},\tilde{\bm{x}}_i,\theta,\phi) \, d\tilde{f}_i \tag{10}$

If the parameters are considered fixed, using GP’s marginalization and conditionalization properties (still Gaussian), we can evaluate the posterior predictive mean $m_p(\tilde{f}|\mathcal{D},\theta,\phi)$ from the conditional mean $\color{green}\mathbb{E}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}]$ (where $\tilde{\bm{f}} \triangleq f(\tilde{\bm{x}})$) （前面推导已经得到）, through marginalizing out the latent variables $\bm{f}$,

$m_p(\tilde{f}|\mathcal{D},\theta,\phi) = \int \textcolor{green}{\mathbb{E}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}]}\, p(\bm{f}|\mathcal{D},\theta,\phi) d\bm{f} \xlongequal{\text{\color{red}代入并保留只与f相关的量}} k(\tilde{\bm{x}},\bm{X}|\theta) \bm{K}_{f,f}^{-1} \mathbb{E}_{\bm{f}|\mathcal{D},\theta,\phi}[\bm{f}] \tag{11}$

The posterior predictive covariance between any set of latent variables $\tilde{\bm{f}}$ is （这一步的推导是利用了Wikipedia: Law of total covariance

\begin{aligned} \text{Cov}_{\tilde{\bm{f}}|\mathcal{D},\theta,\phi} [\tilde{\bm{f}}] &= \mathbb{E}_{\bm{f}|\mathcal{D},\theta,\phi}\left[ \textcolor{green}{\text{Cov}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}]} \right] + \text{Cov}_{\bm{f}|\mathcal{D},\theta,\phi}\left[ \textcolor{green}{\mathbb{E}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}]} \right] \\ &= \overbrace{\textcolor{green}{\text{Cov}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}]}}^{\text{independent of \bm{f}}} + \underbrace{\text{Cov}_{\bm{f}|\mathcal{D},\theta,\phi}\left[ k(\tilde{\bm{x}},\bm{X}|\theta) \bm{K}_{f,f}^{-1} \bm{f}] \right]}_{k(\tilde{\bm{x}},\bm{X})\bm{K}_{f,f}^{-1}\text{Cov}_{\bm{f}|\mathcal{D},\theta,\phi}[\bm{f}]\bm{K}_{f,f}^{-1}k(\bm{X},\tilde{\bm{x}}')} \end{aligned}

Then, the posterior predictive covariance function $k_p(\tilde{\bm{x}},\tilde{\bm{x}}'|\mathcal{D},\theta,\phi)$ is

$k_p(\tilde{\bm{x}},\tilde{\bm{x}}'|\mathcal{D},\theta,\phi) = k(\tilde{\bm{x}},\tilde{\bm{x}}'|\theta) - k(\tilde{\bm{x}},\bm{X}|\theta)\left( \bm{K}_{f,f}^{-1}-\bm{K}_{f,f}^{-1}\text{Cov}_{\bm{f}|\mathcal{D},\theta,\phi}\bm{K}_{f,f}^{-1} \right) k(\bm{X},\tilde{\bm{x}}'|\theta) \tag{13}$

So, even if the exact posterior $p(\tilde{f}|\mathcal{D},\theta,\phi) is not available$

## From latents $\bm{f}$ to observations $\bm{y}$

Bayes

Gaussian observation model: $y_i \sim N(f_i,\sigma^2)$

$p(y_i|f_i,\theta,\overbrace{\phi}^\text{includes \sigma}) = \dots$

Marginal likelihood $p(\bm{y}|\bm{X},\theta,\sigma^2)$ is

$p(\bm{y}|\bm{X},\theta,\sigma^2) = N(\bm{y}|\bm{0},\bm{K}_{f,f}+\sigma^2\bm{I})$

The conditional posterior of latent variables $\bm{f}$ has analytical solution now, (should be done through completing the square. Bishop’s book or GPML book should have details.)

$\bm{f}|\mathcal{D},\theta,\phi \sim N( \bm{K}_{f,f}(\bm{K}_{f,f}+\sigma^2\bm{I})^{-1}\bm{y},\quad \bm{K}_{f,f}-\bm{K}_{f,f}(\bm{K}_{f,f}+\sigma^2\bm{I})^{-1}\bm{K}_{f,f} ) \tag{15}$

Since the conditional posterior of $\bm{f}$ is Gaussian, the posterior process is still a GP, whose mean and covariance function is obtained from Eqs. (11) and (13).

$\tilde{f}|\mathcal{D},\theta,\phi \sim \mathcal{GP}(m_p(\tilde{\bm{x}}),\quad k_p(\tilde{\bm{x}},\tilde{\bm{x}}'))) \tag{16}$

# Digging into Demos

## demo_inputdependentnoise.m

All 'type','FULL'.

lik_inputdependentnoise + gpcf_sexp + gpcf_exp (prior_t for lengthScale_prior and magnSigma2_prior) + 'latent_method', 'Laplace' + gp_optim

lik_t + gpcf_sexp + 'latent_method', 'Laplace' + gp_optim

1D and 2D data

(line 241) if flat priors are used, there might be need to increase gp.latent_opt.maxiter for laplace algorithm to converge properly. gp.latent_opt.maxiter=1e6; （？？这有可能是我算不出来的原因）

## demo_regression_robust.m

All 'type','FULL'.

lik_gaussian (prior_logunif) + gpcf_sexp (prior_t and prior_sqrtunif) + gp_optim

lik_t (prior_loglogunif, prior_logunif) + gpcf_sexp + 'latent_method', 'EP' + gp_optim

lik_t (prior_logunif) + gpcf_sexp + 'latent_method', 'MCMC' + gp_mc

# 研究具体代码实现

Note! If the prior is ‘prior_fixed’ then the parameter in question is considered fixed and it is not handled in optimization, grid integration, MCMC etc.

## 设置GP结构gp_set

type - Type of Gaussian process

• ‘FULL’ full GP (default)
• ‘FIC’ fully independent conditional sparse approximation（需要inducing point X_u
• ‘PIC’ partially independent conditional sparse approximation
• ‘CS+FIC’ compact support + FIC model sparse approximation
• ‘DTC’ deterministic training conditional sparse approximation
• ‘SOR’ subset of regressors sparse approximation
• ‘VAR’ variational sparse approximation

infer_params - String defining which parameters are inferred. The default is covariance+likelihood.

• ‘covariance’ = infer parameters of the covariance functions
• ‘likelihood’ = infer parameters of the likelihood
• ‘inducing’ = infer inducing inputs (in sparse approximations): W = gp.X_u(😃
• ‘covariance+likelihood’ = infer covariance function and likelihood parameters （有什么具体的区别？不是很明白）
• ‘covariance+inducing’ = infer covariance function parameters and inducing inputs
• ‘covariance+likelihood+inducing’

The additional fields when the likelihood is not Gaussian (lik is not lik_gaussian or lik_gaussiansmt) are:

### latent_method and latent_opt

latent_method - Method for marginalizing over latent values （什么意思？用likelihood计算predictive时需要对$f^*$进行marginalization，需要对latent value进行积分。参见GPstuff Doc Eq. (10) ）. Possible methods are ‘Laplace’ (default), ‘EP’ and ‘MCMC’.
latent_opt - Additional option structure for the chosen latent method. See default values for options below.

• ‘MCMC’
• method - Function handle to function which samples the latent values @esls (default), @scaled_mh or @scaled_hmc
• f - 1xn vector of latent values. The default is [].
• ‘Laplace’
• optim_method - Method to find the posterior mode: ‘newton’ (default except for lik_t), ‘stabilized-newton’, ‘fminuc_large’, or ‘lik_specific’ (applicable and default for lik_t)
• tol
• ‘EP’
• ‘robust-EP’

The additional fields needed in sparse approximations are:

X_u - Inducing inputs, no default, has to be set when FIC, PIC, PIC_BLOCK, VAR, DTC, or SOR is used.

Xu_prior - Prior for inducing inputs. The default is prior_unif.

## gp_mc

• hmc_opt - Options structure for HMC sampler (see hmc2_opt). When this is given the covariance function and likelihood parameters are sampled with hmc2 (respecting infer_params option).

• sls_opt - Options structure for slice sampler (see sls_opt). When this is given the covariance function and likelihood parameters are sampled with sls (respecting infer_params option).

• latent_opt - Options structure for latent variable sampler. When this is given the latent variables are sampled with function stored in the gp.fh.mc field in the GP structure. See gp_set. （在gp_set中设置的 'latent_method','MCMC','latent_opt',struct('method',@scaled_mh) 与这里的 latent_opt 不同！！比如在这个例子中，这里的 latent_opt 实际是设置 scaled_mh 的 option。这里容易混淆！）

• lik_hmc_opt - Options structure for HMC sampler (see hmc2_opt). When this is given the parameters of the likelihood are sampled with hmc2. This can be used to have different hmc options for covariance and likelihood parameters.

• lik_sls_opt - Options structure for slice sampler (see sls_opt). When this is given the parameters of the likelihood are sampled with hmc2. This can be used to have different hmc options for covariance and likelihood parameters.

• lik_gibbs_opt - Options structure for Gibbs sampler. Some likelihood function parameters need to be sampled with Gibbs sampling (such as lik_smt). The Gibbs sampler is implemented in the respective lik_* file.

## *_pak and *_unpak

• Combine * parameters into one vector.
• Extract * parameters from the vector.

For lik_*_pak and lik_*_unpak, this is a mandatory subfunction used for example in energy and gradient computations (calculated by gp_eg through calling gp_e and gp_g).

# 我要用的 likelihood function

## lik_gaussian Create a Gaussian likelihood structure

• sigma2 - variance [0.1]
• sigma2_prior - prior for sigma2 [prior_logunif] （相当于默认$\log(\sigma^2)$是uniform的，所以并不是在$[0,+\inf)$完全uniform的？）
• n - number of observations per input (See using average observations below) （不要用这个参数，这个是用来平均 sigma2 的。）

## lik_t Create a Student-t likelihood structure

Parameters for Student-t likelihood [default]

• sigma2 - scale squared [1]
• nu - degrees of freedom [4] （这是 degree of freedom 通常是固定的）
• sigma2_prior - prior for sigma2 [prior_logunif] （为什么是logunif?）
• nu_prior - prior for nu [prior_fixed]

Can be infered by:

• Laplace approximation (need lik_t_ll, lik_t_llg, lik_t_llg2, lik_t_llg3)
• MCMC (need lik_t_ll, lik_t_llg)
• EP (need lik_t_llg2, lik_t_tiltedMoments, lik_t_siteDeriv)
• robust-EP (need lik_t_tiltedMoments2, lik_t_siteDeriv2)

## lik_gaussiansmt Create a Gaussian scale mixture likelihood structure with priors producing approximation of the Student’s t

The parameters of this likelihood can be inferred only by Gibbs sampling by calling GP_MC.

# 我要用的 covariance function

## gpcf_sexp Create a squared exponential (exponentiated quadratic) covariance function

• magnSigma2 - magnitude (squared) [0.1]
• lengthScale - length scale for each input. [1] This can be either scalar - corresponding to an isotropic function or vector defining own length-scale for - each input direction. （为每个输入定义不同的length scale，自动选择）
• magnSigma2_prior - prior for magnSigma2 [prior_logunif] （为什么是logunif？保正？）
• lengthScale_prior - prior for lengthScale [prior_t] （为什么是prior_t？不需要保正？）
• metric - metric structure used by the covariance function [] （不懂）
• selectedVariables - vector defining which inputs are used [all] selectedVariables is short hand for using metric_euclidean with corresponding components
• kalman_deg - Degree of approximation in type ‘KALMAN’ [6]（不懂）

### 子函数

gpcf_sexp_lp: Evaluate the log prior of covariance function parameters, returns $\log p(\theta)$

# 我要用的 priors

## prior_t Student-t prior structure

Parameters for Student-t prior [default]

• mu - location [0]
• s2 - scale [1]
• nu - degrees of freedom [4]
• mu_prior - prior for mu [prior_fixed] （这里居然是 fixed，为什么？是否合理？）
• s2_prior - prior for s2 [prior_fixed] （这里居然是 fixed，为什么？是否合理？）
• nu_prior - prior for nu [prior_fixed] （这里居然是 fixed，为什么？是否合理？）

prior_t_pak中对s2nu进行了log变换

• /Volumes/ExternalDisk/git-collections/gpstuff/gp/demo_hierprior.m pl=prior_t(‘mu_prior’,prior_t); 未看

# Other hidden functions

## gp_eg calls gp_e, gp_g

• GP_EG: Evaluate the energy function (un-normalized negative marginal log posterior) and its gradient
• GP_E: Evaluate the energy function (un-normalized negative log marginal posterior)
• GP_G: Evaluate the gradient of energy (GP_E) for Gaussian Process

The energy is minus log posterior cost function:

$E = EDATA + EPRIOR - \log p(\bm{Y}|\bm{X},\theta) - \log p(\theta)$

where $\theta$ represents the parameters (lengthScale, magnSigma2…), $\bm{X}$ is inputs and $\bm{Y}$ is observations (regression) or latent values (non-Gaussian likelihood).

# 目前实验碰到的一些结论和问题

• (not sure why) 不用 lik_gaussiansmt
• lik_inputdependentnoise() 只支持 'type','FULL'，参见 gpla_e.m line 162 的 switch，在 FIC 没有相应的支持
• 'latent_method','MCMC' 不能和 gp_optim 配合使用：(100% sure)
• 设置FIC 时，在 gp_g <- gp_eg 中 line 556 只计算 gradient w.r.t. Gaussian likelihood function parameters，没有考虑 non-Gaussian + MCMC 的情况。
• PIC line 755 同样没有
• CS+FIC line 996 同样没有
• DTC, VAR, SOR line 1179 同样没有
• KALMAN 不确定
• derivative observations have not been implemented for sparse GPs !!! (see gp_trcov.m line 54)
• gp_set/latent_method 设置用来 sample latent variables 的算法。 (see gp_mc line 341)
• 在使用 lik_t + gp_mc 时候，必须显示地设置 latent_optlik_hmc_opt，不然在 gp_mc 中采样时候会出现少对一种进行采样的情况。（不确定）（也不确定对其它likelihood的情况）

## 经验

• 报这个错误通常是因为在 gp_g 中没有处理相应的 gradient，注意选择匹配的 latent_method。

## 实验记录

1. FULL + lik_gaussian + gpcf-sexp + MAP 成功
2. FULL + lik_gaussian + gpcf-sexp + MCMC 成功
3. FIC + lik_gaussian + gpcf-sexp + MAP 成功
4. FIC + lik_gaussian + gpcf-sexp + MCMC 成功
5. FULL + lik_t + jitter-1e-3 + ARD + Laplace + train-only-around-15-steps: 效果完美，输出的1,2,3-SD coverage始终是100% （为什么overfitting会这么严重？？？）
6. FULL + lik_t + jitter-1e-3 + Laplace + train-only-9-steps: 需要把 jitter 减小，有效果，但不好。
7. FIC-25 + lik_t + jitter-1e-3 + ARD + Laplace + train-only-around-15-steps:
8. FULL + lik_t + jitter-1e-6 + ARD + Laplace + train-200 + sample-100-no-thining: 效果很好，PML=1.29%, QML123=100%，大约1056秒
9. FIC-25 + covariance+likelihood+inducing + lik_t + jitter-1e-6 + ARD + Laplace + train-200 + sample-100-thin-60-1: 效果很好，PML=3.97%，QML123=100%，大约需要计算4个小时

1. FIC-25 + lik_t + MCMC + gp_mc + GPz-init：lengthScale 优化出来的结果都一样
2. FULL + lik_t + Lapalace + gpcf-sexp + MAP
3. FULL + lik_t + Lapalace + gpcf-sexp + MCMC
4. FULL + lik_t + EP + gpcf-sexp + MAP
5. FULL + lik_t + EP + gpcf-sexp + MCMC
6. FULL + lik_t + MCMC + gpcf-sexp + MAP
7. FULL + lik_t + MCMC + gpcf-sexp + MAP