阅读笔记:Heteroscedastic Gaussian processes for uncertain and incomplete data

almosallam_heteroscedastic_2017

Heteroscedastic Gaussian processes for uncertain and incomplete data

Ibrahim Almosallam

PhD Thesis, University of Oxford,
https://ora.ox.ac.uk/objects/uuid:6a3b600d-5759-456a-b785-5f89cf4ede6d

Some useful identities:

p(x)=p(x,y)dy=p(xy)p(y)dy(A.6)p(x) = \int p(x,y)\rm{d}y = \int p(x|y)p(y)\rm{d}y \tag{A.6}

p(x)(1)p(x) \tag{1}

N(xa,A)N(xb,B)=N(ab,A+B)N(xc,C)(A.14)\mathcal{N}(\bm{x}|\bm{a},\bm{A}) \mathcal{N}(\bm{x}|\bm{b},\bm{B}) = \mathcal{N}(\bm{a}|\bm{b},\bm{A+B})\mathcal{N}(\bm{x}|\bm{c},\bm{C})\tag{A.14}

c=C(A1a+B1b),C=(A1+B1)1\bm{c}=\bm{C}(\bm{A}^{-1}\bm{a}+\bm{B}^{-1}\bm{b}),\quad \bm{C}=(\bm{A}^{-1}+\bm{B}^{-1})^{-1}

p(x.............(A.11)p(\bm{x}............. \tag{A.11}

2 Gaussian Processes for Regression

2.6 Summary

We have shown in this chapter that basis function models are sparse Gaussian processes of type SoR, but with a different prior on the latent variables.

BFM 的 latent variables 的 prior 与 SoR 不同,其它相同。

A main distinction between the way ANNs are traditionally optimised and how BFMs are trained, is that the former performs a regularised maximum likelihood estimation, or maximum a posterior (MAP), while the latter maximises the marginal likelihood. Maximising the marginal likelihood is less prone to overfitting, due to the additional cost term that minimises the variance on the weight vector w, which we have shown is related to the sparsity constraint in sparse auto-encoders. In the next chapter, we show how we can use this view of sparse GPs to extend their modelling capabilities.

??Maximum likelihood estimation 和 MAP 是同一个概念(核实一下)

所以 BFM 不容易过拟合,根本上是因为使用了 maximum MARGINAL LIKELIHOOD。

和 ANN 与 AE 才能扯上点关系。

2.1 Full Gaussian Processes

The dataset is D={X,y}\mathcal{D}=\{\bm{X},\bm{y}\}, consisting of input X={xi}i=1nRd×n\bm{X}=\{\bm{x}_i\}_{i=1}^n \in \mathbb{R}^{d\times n} and target outputs y={yi}i=1nRn\bm{y}=\{y_i\}_{i=1}^n \in \mathbb{R}^n.

dd is the dimensionality of the input xi\bm{x}_i

nn is the number of data points

ASSUMPTION: observed target yiy_i is generated by a function of xi\bm{x}_i plus additive noise εi\varepsilon_i:

yi=f(xi)+εi(2.1)y_i = f(\bm{x_i}) + \varepsilon_i \tag{2.1}

εiN(0,σ2),i[0,n]\varepsilon_i \sim \mathcal{N}(0,\sigma^2),\quad \forall i\in[0,n]

LIKELIHOOD: the probability of observing the targets,

p(yfx,σ2)=N(yfx,σ2In)(2.2)p(\bm{y} | \bm{f}_\bm{x},\sigma^2) = \mathcal{N}(\bm{y} | \bm{f}_\bm{x}, \sigma^2 I_n) \tag{2.2}

fx=[f(x1),,f(xn)]T\bm{f}_\bm{x} = [f(\bm{x}_1),\cdots,f(\bm{x}_n)]^T

POSTERIOR (?): Distribution of fx\bm{f}_\bm{x} given the observations

p(fxy,X,σ2)=p(yfx,σ2)p(fxX)p(yX,σ2)(2.3)p(\bm{f}_\bm{x} | \bm{y},\bm{X},\sigma^2) = \frac{p(\bm{y}|\bm{f}_\bm{x},\sigma^2) \cdot p(\bm{f}_\bm{x}|\bm{X})} {p(\bm{y}|\bm{X},\sigma^2)} \tag{2.3}

PRIOR: over the space of functions,

p(fXX)=N(0,K)p(\bm{f}_\bm{X}|\bm{X}) = \mathcal{N}(0,\bm{K})

K=κ(X,X)\bm{K} = \kappa(\bm{X},\bm{X}), symmetric and positive semi-definite

κ(xi,xj)\kappa(\bm{x}_i,\bm{x}_j) is the covariance/kernel function, for example, Mercer kernels.

MARGINAL LIKELIHOOD, computed from Eq.(A.6) and (A.14):

p(yX,θ)=p(yfx,θ)p(fxX,θ)dfx=N(y0,K+σ2In)(2.7)p(\bm{y}|\bm{X},\bm{\theta}) = \int p(\bm{y}|\bm{f}_\bm{x},\bm{\theta}) \cdot p(\bm{f}_\bm{x}|\bm{X},\bm{\theta}) \cdot \rm{d} \bm{f}_\bm{x} = \mathcal{N}(\bm{y}|0,\bm{K}+\sigma^2 I_n) \tag{2.7}

θ={φ,σ2}\bm{\theta} = \{\bm{\varphi},\sigma^2\}, hyper-parameters of the GP model

φ\bm{\varphi}, parameters of the kernel

σ2\sigma^2, the noise variance

ASSUMPTION: the joint distribution is a multivariate Gaussian,

p(y,fX,X,θ)=N([yf]0,[Kxx+σ2InKxKxK])(2.9) p(\bm{y},\bm{f}_*|\bm{X},\bm{X}_*,\bm{\theta}) = \mathcal{N} \left( \left.\left[\begin{matrix}\bm{y}\\\bm{f}_*\end{matrix}\right]\right| \bm{0}, \left[\begin{matrix}\bm{K}_{\bm{xx}}+\sigma^2 I_n & \bm{K}_{\bm{x}*}\\\bm{K}_{*\bm{x}} & \bm{K}_{**}\end{matrix}\right] \right) \tag{2.9}

CONDITIONAL PROBABILITY of function value f\bm{f}_*, computed from Eq.(A.11),

p(fX,D,θ)=N(fE[f],V[f])(2.10)p(\bm{f}_*|\bm{X}_*,\mathcal{D},\bm{\theta}) = \mathcal{N} (\bm{f}_* | \mathbb{E}[\bm{f}_*],\mathbb{V}[\bm{f}_*]) \tag{2.10}

E[f]=Kx(Kxx+σ2In)1y\mathbb{E}[\bm{f}_*] = \bm{K}_{*\bm{x}}(\bm{K}_{\bm{xx}}+\sigma^2 I_n)^{-1}\bm{y}

V[f]=KKx(Kxx+σ2In)1Kx\mathbb{V}[\bm{f}_*] = \bm{K}_{**}-\bm{K}_{*\bm{x}}(\bm{K}_\bm{xx}+\sigma^2 I_n)^{-1}\bm{K}_{\bm{x}*}

PREDICTIVE DISTRIBUTION of future observations y\bm{y}_* given the dataset D\mathcal{D}, computed from Eq.(A.4) and (A.16),

p(yX,D,θ)=p(yf,θ)p(fX,D,θ)df=N(yE[y],V[y])(2.13)p(\bm{y}_*|\bm{X}_*,\mathcal{D},\bm{\theta}) = \int p(\bm{y}_*|\bm{f}_*,\bm{\theta}) \cdot p(\bm{f}_*|\bm{X}_*,\mathcal{D},\bm{\theta}) \cdot \rm{d}\bm{f}_* = \mathcal{N}(\bm{y}_* | \mathbb{E}[\bm{y}_*],\mathbb{V}[\bm{y}_*]) \tag{2.13}

E[y]=E[f]\mathbb{E}[\bm{y}_*] = \mathbb{E}[\bm{f}_*]

V[y]=V[f]+σ2In\mathbb{V}[\bm{y}_*] = \mathbb{V}[\bm{f}_*] + \sigma^2 I_{n*}

The hyperparameters of the GP model are optimized by maximizing the log-marginal likelihood, Eq.(2.7):

lnp(yX,θ)=12yT(K+σ2In)1y12lnK+σ2Inn2ln(2π)(2.17)\ln p(\bm{y}|\bm{X},\bm{\theta}) = -\frac{1}{2} \bm{y}^T(\bm{K}+\sigma^2 I_n)^{-1}\bm{y} - \frac{1}{2} \ln |\bm{K}+\sigma^2 I_n| - \frac{n}{2}\ln(2\pi) \tag{2.17}

Almost the same description and notations to the GPML book (Rasmussen and Nickisch, 2010).

2.2 Numerical Approximation

Low Rank Approximation

Σp=σ2KpxKxp+Kpp(2.23)\Sigma_p = \sigma^{-2}K_{px}K_{xp} + K_{pp} \tag{2.23}

Conjugate Gradient Method

2.3 Sparse Gaussian Processes

2.3.1 Subset of Regressor (SoR)

INDUCING / LATENT variables fp\bm{f}_p: p(fp)=N(fp0,Kpp)p(\bm{f}_p)=\mathcal{N}(\bm{f}_p|0,K_{pp}).

JOINT distribution:

p(fx,fp)=p(fx,ffp)p(fp)dfp(2.24)p(\bm{f}_x,\bm{f}_p) = \int p(\bm{f}_x,\bm{f}_*|\bm{f}_p) p(\bm{f}_p) \cdot {\rm d}\bm{f}_p \tag{2.24}

ASSUMPTION: the conditional distributions of fx\bm{f}_x and f\bm{f}_{*} are independent given fp\bm{f}_p.

p(fx,fp)q(fx,fp)=q(fxfp)q(ffp)p(fp)dfp(2.25)p(\bm{f}_x,\bm{f}_p) \approx q(\bm{f}_x,\bm{f}_p) = \int q(\bm{f}_x|\bm{f}_p) q(\bm{f}_*|\bm{f}_p) p(\bm{f}_p) \cdot {\rm d}\bm{f_p} \tag{2.25}

The exact expression for the training condition and testing condition can be derived from a noise-free GP.

Under the ASSUMPTION that both the training and the test conditionals are deterministic:

q(fxfp)=N(KxpKpp1fp,0)(2.28)q(\bm{f}_x|\bm{f}_p)=\mathcal{N}(K_{xp}K_{pp}^{-1}\bm{f}_p,0) \tag{2.28}

q(ffp)=N(KpKpp1fp,0)(2.29)q(\bm{f}_*|\bm{f}_p)=\mathcal{N}(K_{*p}K_{pp}^{-1}\bm{f}_p,0) \tag{2.29}

Effective prior:

q(fx,f)=N([fx;f]0,[Qxx,Qx;Qx,Q])(2.30)q(\bm{f}_x,\bm{f}_*) = \mathcal{N}([f_x;f_*]|0,[Q_{xx},Q_{x*};Q_{*x},Q_{**}]) \tag{2.30}

where Qab=KapKpp1KpbQ_{ab}=K_{ap}K_{pp}^{-1}K_{pb} is the equivalent covariance function.
(??猜测这一步应该是经过了配方法求解Eq.2.25)

E(f)=σ2KpΣp1Kpxy(2.34)\mathbb{E}(\bm{f}_*) = \sigma^{-2}K_{*p}\Sigma_p^{-1}K_{px}\bm{y} \tag{2.34}

V(f)=KpΣp1Kp(2.35)\mathbb{V}(\bm{f}_*) = K_{*p}\Sigma_p^{-1}K_{p*} \tag{2.35}

This is referred to as the subset of regressors (SoR) sparse Gaussian process
(Candela and Rasmussen, 2005), which is identical to the low rank approximation
method except that the set of inducing points are now optimised as part of the hyperparameter
set. Thus, the low rank approximation method is a sparse GP where both
the training and test conditionals are deterministic and the set of inducing points are
held fixed. (p.20)

2.3.2 FTIC (Fully Independent Training Conditional) / SPGP (Sparse Pseudo-input Gaussian Processes)

ASSUMPTION on training and testing condition (recall Eq.2.28 and 2.29):

q(fxfp)=N(KxpKpp1fp,diag[KxxKxpKpp1Kpx])(2.36)q(\bm{f}_x|\bm{f}_p) = \mathcal{N}(K_{xp}K_{pp}^{-1}\bm{f}_p,\text{diag}[K_{xx}-K_{xp}K_{pp}^{-1}K_{px}]) \tag{2.36}

q(ffp)=p(ffp)(2.37)q(\bm{f}_*|\bm{f}_p) = p(\bm{f}_*|\bm{f}_p) \tag{2.37}

p(yfx)q(yfx)=N(fx,Λ)(2.38)p(\bm{y}|\bm{f}_x) \approx q(\bm{y}|\bm{f}_x) = \mathcal{N}(\bm{f}_x,\Lambda) \tag{2.38}

where Λ=diag[KxxQxx]+σ2In\Lambda = \text{diag}[K_{xx}-Q_{xx}]+\sigma^2I_n.

2.4 Basic Function Models

20201002 重新梳理思路:
定义 BFM 模型
–> 得到 BFM 模型下 y 的 likelihood p(yX,θ,w)p( y | X, \theta, w) (注意这里带有 w )
–> 定义 BFM 模型中 w 的 prior p(wα)p( w | \alpha ) (此时 GP 的 prior 已经在 full GP 定义过了)
–> 计算 evidence (marginal likelihood) p(yX,θ)p( y | X, \theta ):把 p(fX,θ,w)p( f | X, \theta, w ) 中的 ww 积分掉;把 p(yfx,X,θ)p( y | f_x, X, \theta) 中的 fxf_x 积分掉
–> 用 Bayes’ theorem 计算 ww 的 posterior p(wy,X,θ,α)=p(yw,X,θ,α)p(wα)p(yX,θ,α)p(w|y,X,\theta,\alpha)= \frac{p(y|w,X,\theta,\alpha) p(w|\alpha)}{p(y|X,\theta,\alpha)} (右边这几项都有了,所以可以算出来)
–> 用 Bayes’ theorem 也可以关于 ww 计算 evidence $p( y | X, \theta ),即拆分 p(y,wX,θ,α)p( y, w | X, \theta, \alpha ) 得到 evidence
–> 得到较简单形式的 marginal likelihood function

BFM: linear combination of mm non-linear basis functions plus additive noise

yi=ϕ(xi)Tw+εiy_i = \bm{\phi}(\bm{x}_i)^T\bm{w} + \varepsilon_i

ϕ(xi)=[ϕ1(xi),,ϕm(xi)]TRm,mn\bm{\phi}(\bm{x}_i) = [\phi_1(\bm{x}_i),\cdots,\phi_m(\bm{x}_i)]^T \in \mathbb{R}^m, \quad m\ll n

εiN(0,σ2)\varepsilon_i \sim \mathcal{N}(0,\sigma^2)

LIKELIHOOD: (应该是通过配方法得到的)

p(yX,θ,w)=p(yfx,θ)p(fxX,θ,w)dfxp(\bm{y}|\bm{X},\bm{\theta},\bm{w}) = \int p(\bm{y}|\bm{f}_\bm{x},\bm{\theta}) \cdot p(\bm{f}_\bm{x}|\bm{X},\bm{\theta},\bm{w}) \cdot \rm{d}\bm{f}_\bm{x}

=N(yfx,σ2In)N(fxΦxTw,0)dfx= \int \mathcal{N}(\bm{y}|\bm{f}_\bm{x},\sigma^2 I_n) \cdot \mathcal{N}(\bm{f}_\bm{x}|\bm{\Phi}_\bm{x}^T\bm{w},0) \cdot \rm{d}\bm{f}_\bm{x}

=N(yΦxTw,σ2In)(2.47)= \mathcal{N}(\bm{y}|\bm{\Phi}_\bm{x}^T\bm{w},\sigma^2 I_n) \tag{2.47}

Φx=[ϕ1(x1)ϕ1(xn)ϕm(x1)ϕm(xn)]\bm{\Phi}_\bm{x} = \left[ \begin{matrix} \phi_1(\bm{x}_1) & \cdots & \phi_1(\bm{x}_n) \\ \vdots & \ddots & \vdots \\ \phi_m(\bm{x}_1) & \cdots & \phi_m(\bm{x}_n) \end{matrix} \right]

PRIOR: (assumed to be smooth)

p(wα)=N(w0,A1)p(\bm{w}|\alpha) = \mathcal{N}(\bm{w}|0,\bm{A}^{-1})

A=αIm\bm{A}=\alpha I_m

p(fxX,θ)=p(fxX,θ,w)p(wα)dw(2.51.改)p(\bm{f}_x|X,\bm{\theta}) = \int p(\bm{f}_x|X,\bm{\theta},\bm{w}) p(\bm{w}|\alpha) \cdot {\rm d}\bm{w} \tag{2.51.改}

=N(fx0,ΦxTA1Φx)(2.54.改)= \mathcal{N}(\bm{f}_x|0,\Phi_x^TA^{-1}\Phi_x) \tag{2.54.改}

MARGINAL LIKELIHOOD (from Wiki: In the context of Bayesian statistics, it may also be referred to as the evidence or model evidence.):

p(yX,θ)=p(yfx,θ)p(fxX,θ)dfx(2.55)p(\bm{y}|X,\bm{\theta}) = \int p(\bm{y}|\bm{f}_x,\bm{\theta})p(\bm{f}_x|X,\bm{\theta}) \cdot {\rm d}\bm{f}_x \tag{2.55}

=N(y0,ΦxTA1Φx+σ2In)(2.56)= \mathcal{N}(\bm{y}|0,\Phi_x^TA^{-1}\Phi_x+\sigma^2I_n) \tag{2.56}

[This marginal likelihood function should be maximized in order to get all hyperparameters.]

[另一种推导 marginal likelihood 表达式的方法。没有看明白。感觉有循环推导的嫌疑。]

[从另一个思路来直接简化 Eq.2.56,通过代入一个特殊的 w\bm{w}。]

POSTERIOR distribution of w\bm{w}: (using β=σ2\beta=\sigma^{-2})

p(wy,X,θ)=p(yX,θ,w)p(wα)p(yX,θ)=N(wwˉ,Σw1)(2.57/58)p(\bm{w}|\bm{y},X,\bm{\theta}) = \frac{p(\bm{y}|X,\bm{\theta},\bm{w})p(\bm{w}|\alpha)}{p(\bm{y}|X,\bm{\theta})} = \mathcal{N}(\bm{w}|\bm{\bar{w}},\Sigma_w^{-1}) \tag{2.57/58}

wˉ=βΣw1Φxy(2.59)\bm{\bar{w}} = \beta\Sigma_w^{-1}\Phi_x\bm{y} \tag{2.59}

Σw=βΦxΦxT+A(2.60)\Sigma_w = \beta\Phi_x\Phi_x^T+A \tag{2.60}

MARGINAL LIKELIHOOD:

p(yX,θ)=p(yX,θ,w)p(wα)p(wy,X,θ)(2.61)p(\bm{y}|X,\bm{\theta}) = \frac{p(\bm{y}|X,\bm{\theta},\bm{w})p(\bm{w}|\alpha)}{p(\bm{w}|\bm{y},X,\bm{\theta})} \tag{2.61}

=N(yΦxTw,β1In)N(w0,A1)N(wwˉ,Σw1)(2.62)= \frac{\mathcal{N}(\bm{y}|\Phi_x^T\bm{w},\beta^{-1}I_n)\mathcal{N}(\bm{w}|0,A^{-1})}{\mathcal{N}(\bm{w}|\bm{\bar{w}},\Sigma_w^{-1})} \tag{2.62}

Substituting w=wˉ\bm{w} = \bm{\bar{w}} [实际上是取了最大后验分布的 w\bm{w},参见 bishop-pattern-2006 中 Eq.3.49 附近的的讨论。这部分是否与 wMAP\bm{w}_\text{MAP} 有关系?需要继续研究。] [对任意的 w\bm{w} 成立,取特值达到化简的目的], get

p(yX,θ)=N(yΦxTwˉ,β1In)N(wˉ0,A1)(2π)m/2Σw1/2(2.63)p(\bm{y}|X,\bm{\theta}) = \mathcal{N}(\bm{y}|\Phi_x^T\bar{\bm{w}},\beta^{-1}I_n) \mathcal{N}(\bm{\bar{w}}|0,A^{-1})(2\pi)^{m/2}|\Sigma_w|^{-1/2} \tag{2.63}


可以直接计算 Eq.2.62 的值,而不采用代入 wˉ\bar{\bm{w}} 的方式:

Exponential terms of Eq.2.62=exp[12[(yΦxTw)TβIn(yΦxTw)+wTAw(wwˉ)TΣw(wwˉ)]] \text{Exponential terms of Eq.2.62} = \exp\left[ -\frac{1}{2} \left[ (\bm{y}-\Phi_x^T\bm{w})^T\beta I_n(\bm{y}-\Phi_x^T\bm{w}) + \bm{w}^TA\bm{w} - (\bm{w}-\bar{\bm w})^T\Sigma_w(\bm{w}-\bar{\bm w}) \right] \right]

2(inside of exp)=βyTy2βwTΦxy+βwTΦxΦxTw+wTAwwTΣww+2wTΣwwˉwˉTΣwwˉ(expand)-2 \cdot \text{(inside of exp)} = \beta y^Ty - 2\beta\bm{w}^T\Phi_xy + \beta\bm{w}^T\Phi_x\Phi_x^T\bm{w}+\bm{w}^TA\bm{w} - w^T\Sigma_ww+ 2w^T\Sigma_w\bar{w} - \bar{w}^T\Sigma_w\bar{w} \tag{expand}

=βyTy2βwTΦxy+wT(βΦxΦxT+A)wwTΣww+2wTΣwwˉwˉTΣwwˉ(collect.cancel)= \beta y^Ty - 2\beta\bm{w}^T\Phi_xy + \bm{w}^T(\beta\Phi_x\Phi_x^T+A)\bm{w} - w^T\Sigma_ww+ 2w^T\Sigma_w\bar{w} - \bar{w}^T\Sigma_w\bar{w} \tag{collect.cancel}

=βyTy2βwTΦxy+2wTΣw(βΣw1Φxy)wˉTΣwwˉ(collect.cancel)= \beta y^Ty - 2\beta\bm{w}^T\Phi_xy + 2w^T\Sigma_w(\beta\Sigma_w^{-1}\Phi_x\bm{y}) - \bar{w}^T\Sigma_w\bar{w} \tag{collect.cancel}

=βyTywˉT(βΦxΦxT+A)wˉ= \beta y^Ty - \bar{w}^T(\beta\Phi_x\Phi_x^T+A)\bar{w}

=(yΦxTwˉ)TβIn(yΦxTwˉ)+2wˉTΦxβy2wˉTΦxΦxTwˉwˉTAwˉ(complete.square)= (\bm{y}-\Phi_x^T\bm{\bar{w}})^T\beta I_n(\bm{y}-\Phi_x^T\bm{\bar{w}}) + 2\bar{w}^T\Phi_x\beta\bm{y} - 2\bar{w}^T\Phi_x\Phi_x^T\bar{w} - \bar{w}^TA\bar{w} \tag{complete.square}

=(yΦxTwˉ)TβIn(yΦxTwˉ)+2wˉTΦxβy2wˉTΣwβΣw1Φxy+wˉTAwˉ(substitute.cancel)= (\bm{y}-\Phi_x^T\bm{\bar{w}})^T\beta I_n(\bm{y}-\Phi_x^T\bm{\bar{w}}) + 2\bar{w}^T\Phi_x\beta\bm{y} - 2\bar{w}^T\Sigma_w\beta\Sigma_w^{-1}\Phi_x\bm{y} + \bar{w}^TA\bar{w} \tag{substitute.cancel}

=(yΦxTwˉ)TβIn(yΦxTwˉ)+wˉTAwˉ(another.square)= (\bm{y}-\Phi_x^T\bm{\bar{w}})^T\beta I_n(\bm{y}-\Phi_x^T\bm{\bar{w}}) + \bar{w}^TA\bar{w} \tag{another.square}

[问题:直接代入 wˉ\bar{w} 得到了同样的结果,与直接计算的结果相同。直接代入的数学依据是什么?待查资料。 与超哥讨论得知:实际是因为 Eq.2.62 对任意 w\bm{w} 成立,取特值代入,达到化简的目的。]


LOG MARGINAL LIKELIHOOD:

lnp(yX,θ)=β2ΦxTwˉy2+n2lnβn2ln2πα2wˉ+m2lnα12lnΣw(2.64)\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{\alpha}{2}\|\bm{\bar{w}}\| + \frac{m}{2}\ln\alpha - \frac{1}{2}\ln|\Sigma_w| \tag{2.64}

[我的更正:in Eq.2.64: ΦxwˉΦxTwˉ\Phi_x\bar{\bm{w}}\rightarrow\Phi_x^T\bar{\bm{w}}]


验证 Eq.2.64 和 ln(Eq.2.56) 等价

r.h.s of Eq.2.64:

2term12term4=β(ΦxTwˉy)T(ΦxTwˉy)+αwˉTwˉ-2\cdot term1 -2\cdot term4 = \beta (\Phi_x^T\bar{w}-\bm{y})^T(\Phi_x^T\bar{w}-\bm{y}) + \alpha \bar{w}^T\bar{w}

=βyTy+β(wˉTΦxΦxTwˉ2wˉTΦxy)+αwˉTwˉ(expand)= \beta\bm{y}^T\bm{y} + \beta( \bar{w}^T\Phi_x\Phi_x^T\bar{w} - 2 \bar{w}^T\Phi_x\bm{y} ) + \alpha\bar{w}^T\bar{w} \tag{expand}

=βyTy+βyT[β2ΦxTΣwTΦxΦxTΣw1Φx]y2βyT[βΦxTΣwTΦx]y+αyT[β2ΦxTΣwTΣw1Φx]y(expand)= \beta\bm{y}^T\bm{y} + \beta y^T[\beta^2\Phi_x^T\Sigma_w^{-T}\Phi_x\Phi_x^T\Sigma_w^{-1}\Phi_x]y - 2\beta y^T[\beta\Phi_x^T\Sigma_w^{-T}\Phi_x]y+ \alpha y^T[\beta^2\Phi_x^T\Sigma_w^{-T}\Sigma_w^{-1}\Phi_x]y \tag{expand}

=βyTy+β2yT[βΦxTΣwTΦxΦxTΣ1Φx2ΦxTΣwTΦx+αΦxTΣwTΣ1Φx]y(collect)= \beta\bm{y}^T\bm{y} + \beta^2\bm{y}^T [\beta\Phi_x^T\Sigma_w^{-T}\Phi_x\Phi_x^T\Sigma^{-1}\Phi_x-2\Phi_x^T\Sigma_w^{-T}\Phi_x+\alpha\Phi_x^T\Sigma_w^{-T}\Sigma^{-1}\Phi_x] \bm{y} \tag{collect}

=βyTy+β2yT[ΦxTΣwT(βΦxΦxT+αIn)Σ1Φx2ΦxTΣwTΦx]y(collect.inside)= \beta\bm{y}^T\bm{y} + \beta^2\bm{y}^T \left[ \Phi_x^T\Sigma_w^{-T}( \beta\Phi_x\Phi_x^T+\alpha I_n )\Sigma^{-1}\Phi_x -2\Phi_x^T\Sigma_w^{-T}\Phi_x \right] \bm{y} \tag{collect.inside}

=βyTy+β2yT[ΦxTΣwTΣΣ1Φx2ΦxTΣwTΦx]y(cancel.inverse)= \beta\bm{y}^T\bm{y} + \beta^2\bm{y}^T \left[ \Phi_x^T\Sigma_w^{-T}\Sigma\Sigma^{-1}\Phi_x -2\Phi_x^T\Sigma_w^{-T}\Phi_x \right] \bm{y} \tag{cancel.inverse}

=βyTyβ2yT(ΦxTΣw1Φx)y(proof)= \beta\bm{y}^T\bm{y} - \beta^2\bm{y}^T ( \Phi_x^T\Sigma_w^{-1}\Phi_x ) \bm{y} \tag{proof}

ln of Eq.2.56

p(yX,θ)=12πn/21ΦxTA1Φx+σ2In1/2exp[12yT(ΦxTA1Φx+σ2In)1y](2.56.expansion)p(\bm{y}|X,\bm{\theta})=\frac{1}{2\pi^{n/2}} \frac{1}{|\Phi_x^TA^{-1}\Phi_x+\sigma^2I_n|^{1/2}} \exp \left[ -\frac{1}{2} \bm{y}^T (\Phi_x^TA^{-1}\Phi_x+\sigma^2I_n)^{-1}\bm{y} \right] \tag{2.56.expansion}

lnp(yX,θ)=n2ln2π12lnΦxTA1Φ+σ2In12yT(ΦxTA1Φx+σ2In)1y(2.56.log)\ln p(\bm{y}|X,\bm{\theta}) = -\frac{n}{2}\ln2\pi - \frac{1}{2}\ln|\Phi_x^{T}A^{-1}\Phi+\sigma^2I_n| -\frac{1}{2} \bm{y}^T (\Phi_x^TA^{-1}\Phi_x+\sigma^2I_n)^{-1}\bm{y} \tag{2.56.log}

r.h.s of Eq.2.56.log

2term2=lnσ2In+lnA1+lnA+Φxσ2InΦxT-2\cdot term2 = \ln|\sigma^2I_n| + \ln|A^{-1}| +\ln |A+\Phi_x\sigma^{-2}I_n\Phi_x^T|

=nlnβmlnα+lnΣw= -n\ln\beta - m\ln\alpha + \ln|\Sigma_w|

2term3=yT(σ2Inσ2InΦxT(A+Φxσ2InΦxT)1Φxσ2In)y-2\cdot term3 = \bm{y}^T (\sigma^{-2}I_n - \sigma^{-2}I_n\Phi_x^T(A+\Phi_x\sigma^{-2}I_n\Phi_x^T)^{-1}\Phi_x\sigma^{-2}I_n) \bm{y}

=yT(βInβ2ΦxT(A+βΦxΦxT)1Φx)y= \bm{y}^T (\beta I_n - \beta^2 \Phi_x^T(A+\beta\Phi_x\Phi_x^T)^{-1}\Phi_x) \bm{y}

=βyTyβ2yTΦxTΣw1Φxy(same.to.proof)= \beta\bm{y}^T\bm{y} - \beta^2\bm{y}^T\Phi_x^T\Sigma_w^{-1}\Phi_x\bm{y} \tag{same.to.proof}

Other terms can be easily matched. So, they are equivalent.


PREDICTIVE DISTRIBUTION:

p(yX,X,y,θ)=N(yE(y),V(y))(2.70)p(\bm{y}_*|X_*,X,\bm{y},\bm{\theta}) = \mathcal{N}(\bm{y}_*|\mathbb{E}(\bm{y}_*),\mathbb{V}(\bm{y}_*)) \tag{2.70}

E(y)=Φwˉ(2.71)\mathbb{E}(\bm{y}_*) = \Phi_* \bar{\bm{w}} \tag{2.71}

V(y)=ΦTΣw1Φ+σ2In(2.72)\mathbb{V}(\bm{y}_*) = \Phi_*^T\Sigma_w^{-1}\Phi_* + \sigma^2 I_n \tag{2.72}

Equivalent to SoR with a different prior on the latent variables: (这部分不是太理解)

q(fxfp,X,θ)=N(fxΦxA1fp,0)(2.73)q(\bm{f}_x|\bm{f}_p,X,\bm{\theta}) = \mathcal{N}(\bm{f}_x|\Phi_xA^{-1}\bm{f}_p,0) \tag{2.73}

q(ffp,X,θ)=N(fΦA1fp,0)(2.74)q(\bm{f}_*|\bm{f}_p,X_*,\bm{\theta}) = \mathcal{N}(\bm{f}_*|\Phi_*A^{-1}\bm{f}_p,0) \tag{2.74}

q(fpθ)=N(fp0,A)(2.75)q(\bm{f}_p|\bm{\theta}) = \mathcal{N}(\bm{f}_p|0,A) \tag{2.75}

Similarly, SoR is a BFM with a prior on weights: (这部分不是太理解)

p(wθ)=N(w0,Kpp1)p(\bm{w}|\bm{\theta}) = \mathcal{N}(\bm{w}|0,K_{pp}^{-1})

2.5 Relations to other Methods (暂时略过)

2.5.1 ANN

2.5.2 Relations to Stacked Auto-encoders

3 Extending Sparse Gaussian Process

The remaining sections, unifying ARD with the other features as a single sparse GP framework, are new contributions by this thesis. (p.31)

3.1 Automatic Relevance Determination

PRECISION matrix: (a precision parameter per weight)

A=diag[α],α=[α1,,αm]TA = \text{diag}[\bm{\alpha}],\quad \bm{\alpha}=[\alpha_1,\dots,\alpha_m]^T

New LOG MARGINAL LIKELIHOOD function: (the old one is Eq.2.64; replace all αIn\alpha I_n with AA)

lnp(yX,θ)=β2ΦxTwˉy2+n2lnβn2ln2π12wˉTAwˉ+12lnA12lnΣw(3.1)\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{1}{2}\bm{\bar{w}}^TA\bm{\bar{w}} + \frac{1}{2}\ln|A| - \frac{1}{2}\ln|\Sigma_w| \tag{3.1}

lnp(yX,θ)=β2ΦxTwˉy2+n2lnβn2ln2πα2wˉ+m2lnα12lnΣw(2.64.repeat.for.compare)\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{\alpha}{2}\|\bm{\bar{w}}\| + \frac{m}{2}\ln\alpha - \frac{1}{2}\ln|\Sigma_w| \tag{2.64.repeat.for.compare}

(关于 RVM 的讨论。更多介绍参考 Tipping 的文章或者 Bishop-pattern-2006) A similar approach was proposed by Tipping (2001), coined as the relevance vector machine (RVM), where the set of basis function locations P was set equal to the locations of the training samples X and held fixed. Only the precision parameter and the values are optimised to determine the relevant set of vectors from the training set. This approach is still computationally expensive and the work discussed in Tipping (2001) proposed an iterative workaround to add and remove vectors incrementally.

3.2 Heteroscedastic Noise

V(y)=ΦTΣw1Φ+σ2In(2.72.repeat)\mathbb{V}(\bm{y}_*) = \Phi_*^T\Sigma_w^{-1}\Phi_* + \sigma^2 I_n \tag{2.72.repeat}

The predictive variance in Equation (2.72) has two components:

  • The first term, V(f)=ΦTΣw1Φ\mathbb{V}(\bm{f}_*)=\Phi_*^T\Sigma_w^{-1}\Phi_*, is the model variance.
    The model variance depends on the data density of the training sample at x\bm{x}_*. Theoretically, this component of the model variance will go to zero as the size of the data set increases.
    This term hence models our underlying uncertainty about the mean function. The model becomes very con dent about the posterior mean when presented with a large number of samples at x\bm{x}_*, at which point the predictive variance reduces to the intrinsic noise variance.
  • The second term, σ2In\sigma^2I_n, is the noise uncertainty.
    At this point, it is assumed to be white Gaussian noise with a fixed precision β=σ2\beta = \sigma^{-2}.

model the function as a linear combination of basis functions: (choose the exponential form to ensure positivity)

β(x)=exp[ϕT(x)v+b](p.33)\beta(\bm{x}) = \exp \left[ \bm{\phi}^T(\bm{x})\bm{v} + b \right] \tag{p.33}

New LIKELIHOOD function: (the old one is Eq.2.47)

p(yX,θ,w)=N(yΦxw,B1)(3.3)p(\bm{y}|X,\bm{\theta},\bm{w}) = \mathcal{N}(\bm{y}|\Phi_x\bm{w},B^{-1}) \tag{3.3}

p(yX,θ,w)=N(yΦxTw,σ2In),σ2In=β1In(2.47.repeat.for.compare)p(\bm{y}|X,\bm{\theta},\bm{w}) = \mathcal{N}(\bm{y}|\bm{\Phi}_\bm{x}^T\bm{w},\sigma^2 I_n),\quad\sigma^2I_n = \beta^{-1}I_n \tag{2.47.repeat.for.compare}

where B=diag[β1,,βn]B=\text{diag}[\beta_1, \dots, \beta_n] is the precision matrix.

New POSTERIOR distribution of w\bm{w}: (the old ones are Eq.2.58–60; replace βIn\beta I_n with BB.)

p(wy,X,θ)=N(wwˉ,Σw1)(3.4)p(\bm{w}|\bm{y},X,\bm{\theta}) = \mathcal{N}(\bm{w}|\bm{\bar{w}},\Sigma_w^{-1}) \tag{3.4}

wˉ=Σw1ΦxBy(3.5)\bm{\bar{w}} = \Sigma_w^{-1}\Phi_xB\bm{y} \tag{3.5}

Σw=ΦxBΦxT+A(3.6)\Sigma_w = \Phi_x B\Phi_x^T + A \tag{3.6}

New LOG MARGINAL POSTERIOR distribution of w\bm{w}: (replace βIn\beta I_n with BB)

lnp(yX,θ)=12δTBδ+12lnBn2ln2π12wˉTAwˉ+12lnA12lnΣw(3.7)\ln p(\bm{y}|X,\bm{\theta}) = -\frac{1}{2}\bm{\delta}^TB\bm{\delta} + \frac{1}{2}\ln|B| -\frac{n}{2}\ln2\pi - \frac{1}{2}\bm{\bar{w}}^TA\bm{\bar{w}} + \frac{1}{2}\ln|A| - \frac{1}{2}\ln|\Sigma_w| \tag{3.7}

lnp(yX,θ)=β2ΦxTwˉy2+n2lnβn2ln2π12wˉTAwˉ+12lnA12lnΣw(3.1.repeat.for.compare)\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{1}{2}\bm{\bar{w}}^TA\bm{\bar{w}} + \frac{1}{2}\ln|A| - \frac{1}{2}\ln|\Sigma_w| \tag{3.1.repeat.for.compare}

lnp(yX,θ)=β2ΦxTwˉy2+n2lnβn2ln2πα2wˉ+m2lnα12lnΣw(2.64.repeat.for.compare)\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{\alpha}{2}\|\bm{\bar{w}}\| + \frac{m}{2}\ln\alpha - \frac{1}{2}\ln|\Sigma_w| \tag{2.64.repeat.for.compare}

where δ=yΦxTwˉ\delta = \bm{y} - \Phi_x^T\bm{\bar{w}}.

A prior on the weights v\bm{v} in Eq.p.33 to favour the simplest precision function βi\beta_i, namely that v\bm{v} is normally distributed with a mean of 0 and a diagonal precision matrix T=diag[τ1,,τm]T = \text{diag}[\tau_1, \dots, \tau_m]:

p(vτ)=N(v0,T)(p.35)p(\bm{v}|\bm{\tau}) = \mathcal{N}(\bm{v}|0,T) \tag{p.35}

Since

lnp(yX,θ)=ln(p(yX,θ,ν)p(ντ))\ln p(\bm{y}|X,\bm{\theta}) = \ln \left( p(\bm{y}|X,\bm{\theta},\bm{\nu})p(\bm{\nu}|\bm{\tau}) \right)

the final new log marginal likelihood function is:

L(D,θ)=lnp(yX,θ)=12δTBδ+12lnBn2ln2π\mathcal{L}(\mathcal{D},\bm{\theta}) = \ln p(\bm{y}|X,\bm{\theta}) = -\frac{1}{2}\bm{\delta}^TB\bm{\delta} + \frac{1}{2}\ln|B| -\frac{n}{2}\ln2\pi

12wˉTAwˉ+12lnA12lnΣw- \frac{1}{2}\bm{\bar{w}}^TA\bm{\bar{w}} + \frac{1}{2}\ln|A| - \frac{1}{2}\ln|\Sigma_w|

12vTTv+12lnTm2ln2π(3.9)-\frac{1}{2}\bm{v}^TT\bm{v} + \frac{1}{2}\ln|T| - \frac{m}{2}\ln2\pi \tag{3.9}

where τ\bm{\tau} acts as an automatic relevance determination cost for the noise process, allowing the objective to dynamically select different sets of relevant basis functions for both the posterior mean and variance estimation.

All hyperparameters are

θ=[φ,α,ν,b]\bm{\theta} = \left[ \bm{\varphi}, \bm{\alpha}, \bm{\nu}, b \right]

New predictive distribution of one sample x\bm{x}_*,

p(yx,D,θ)=N(yE(y),V(y))(3.10)p(y_*|\bm{x}_*,\mathcal{D},\bm{\theta}) = \mathcal{N}(y_*|\mathbb{E}(y_*),\mathbb{V}(y_*)) \tag{3.10}

E(y)=ϕ(x)Twˉ,wˉ is obtained through training(3.11)\mathbb{E}(y_*) = \bm{\phi}(x_*)^T \bm{\bar{w}}, \quad\bm{\bar{w}}\text{ is obtained through training} \tag{3.11}

V(y)=ϕ(x)TΣw1ϕ(x)+β1(x),β is obtained by Eq.p.33(3.12)\mathbb{V}(y_*) = \bm{\phi}(x_*)^T \Sigma_w^{-1} \bm{\phi}(x_*) + \beta^{-1}(\bm{x}_*), \quad\beta\text{ is obtained by Eq.p.33} \tag{3.12}

3.3 Strictly Positive Outputs (未看)

3.4 Cost-sensitive Learning (未看)

3.5 Prior Mean Function

L(D,θ^)=12δ^TΩBδ^+12tr(ΩlnB)12trln(Ωln2π)\mathcal{L}(\mathcal{D},\hat{\bm{\theta}}) = -\frac{1}{2} \hat{\bm{\delta}}^T\Omega B\hat{\bm{\delta}} + \frac{1}{2} {\rm tr}(\Omega\odot\ln B) - \frac{1}{2}{\rm tr}\ln(\Omega\odot\ln2\pi)

=12(wˉ^a^)TA^(wˉ^a^)+12lnA^12lnΣw^= \frac{1}{2} (\hat{\bar{\bm{w}}}-\hat{\bm{a}})^T\hat{A}(\hat{\bar{\bm{w}}}-\hat{\bm{a}}) + \frac{1}{2}\ln|\hat{A}| - \frac{1}{2}\ln|\hat{\Sigma_w}|

=12vTTv+12lnTm2ln2π(3.36)= \frac{1}{2}\bm{v}^TT\bm{v} + \frac{1}{2}\ln|T| - \frac{m}{2}\ln2\pi \tag{3.36}

4 Optimisation

这章主要给出了训练时需要用到的梯度公式。难点在于矩阵微分的推导和形式。

4.5 Summary

讲了计算 Sigmoid 和 RBF 的梯度的解析计算方法(尚未掌握)。

4.1 Gradient-based Optimisation

  • BFGS
  • L-BFGS

4.2 The General Case

Using the properties of matrix derivatives (Petersen and Pedersen, 2012)
(学这本书中的矩阵微分性质)

4.3 The Sigmoidal Function

4.4 Radial Basis Functions

TIME COMPLEXITY:

Model | Time complexity

  • ANN(l-layers) | O(nmd+(l-1)(nm^2))
  • FITC | O(nmd+nm2)O(nmd+nm^2)
  • GPVL | O(nmd+nm2)O(nmd+nm^2)
  • GPVD | O(nmd+nm2)O(nmd+nm^2)
  • GPVC | O(nmd2+nm2)O(nmd^2+nm^2)
  • Full GP | O(n3)O(n^3)

where dd is the dimension of input x\bm{x}, nn is the number of training data, mm is the number of basis functions.

5 Noisy or Missing Variables

这章很重要,讲了如果 xx 有噪音的情况下,如何进行训练和预测。

推导很复杂,繁琐,目前似乎应用不到,暂时不看了。
这章的很多推导都没有详细过程

代码中的 Ψ\Psi 出现在这一节,考虑输入变量 x\bm{x} 本身存在不确定性时的推导。

(p.63) We will first address how to predict with known input noise in Section 5.1.1, since this can be addressed independently of the training data.

(p.64) In Section 5.2.1, we consider the case in which training data inputs are uncertain by approximating the expected value of the marginal likelihood.

先研究 prediction,因为 prediction 简单,不依赖训练数据。问题:

  • 是不是要先知道输入的分布才行?
  • 为什么和 training 是分开的?

(p.70) It requires us to have a probabilistic model for each possible set of missing variables, which grows exponentially large as the dimensionality of the input increases.

6 Photometric Redshift Estimation

(p.126) The former, the model uncertainty, helps in identifying regions where more data is needed, whereas the latter, the intrinsic noise uncertainty, helps in identifying regions where better or more precise features are needed.

(p.133) Finally, we introduced the capability of handling missing photometry, not present in any other method apart from random forests. This is particularly useful when using a model trained on one survey to predict the photometric redshift on another survey that does not share the same photometry but overlaps.

1
【这篇 thesis 暂时看完,暂告一段落,开始继续研究 GPz 的代码。】