All about Gaussian Process

In this post I summarize everything about GP. An individual post will be organized when I felt that I got enough materials, insights, and thoughts.

At the end, I attach all other individual posts about GP or having mentioned GP.

Summary of GPstuff document v4.6

From prior to posterior predictive

Bayes

Bayes inference

Observation model: yf,ϕΠi=1np(yifi,ϕ)\bm{y}|\bm{f},\phi \sim \Pi_{i=1}^n p(y_i|f_i,\phi)

GP prior: f(x)θGP(m(x),k(x,xθ))f(\bm{x})|\theta \sim \mathcal{GP}\left(m(\bm{x}), k(\bm{x},\bm{x}'|\theta)\right)

hyperprior: ϑ[θ,ϕ]p(θ)p(ϕ)\vartheta \triangleq [\theta,\phi] \sim p(\theta)p(\phi)

The latent function value f(x)f(\bm{x}) at fixed x\bm{x} is called a latent variable.

Any set of function values f[f1,f2,]T\bm{f} \triangleq [f_1,f_2,\dots]^T has a multivariate Gaussian distribution

p(fX,θ)=N(f0,Kf,f)p(\bm{f}|\bm{X},\theta) = N(\bm{f}|\bm{0},\bm{K}_{f,f})

Predict the values f~\tilde{\bm{f}} at new input locations X~\tilde{\bm{X}}, the joint distribution is

[ff~]X,X~,θN(0,[Kf,fKf,f~Kf~,fKf~,f~])\begin{bmatrix} \bm{f}\\ \tilde{\bm{f}} \end{bmatrix}| \bm{X}, \tilde{\bm{X}}, \theta \sim N\left(\bm{0}, \begin{bmatrix} K_{f,f} & K_{f,\tilde{f}}\\ K_{\tilde{f},f} & K_{\tilde{f},\tilde{f}} \end{bmatrix}\right)

The conditional distribution of f~\tilde{\bm{f}} given f\bm{f} is

f~f,X,X~,θN(Kf~,fKf,f1f,Kf~,f~Kf~,fKf,f1Kf,f~)\tilde{f} | \bm{f},\bm{X},\tilde{\bm{X}}, \theta \sim N(\bm{K}_{\tilde{f},f}\bm{K}_{f,f}^{-1}\bm{f},\, \bm{K}_{\tilde{f},\tilde{f}}-\bm{K}_{\tilde{f},f}\bm{K}_{f,f}^{-1}\bm{K}_{f,\tilde{f}})

So the conditional distribution of the latent function f(x~)f(\tilde{\bm{x}}) is also a GP with

  • conditional mean function: Ef~f,θ,ϕ[f(x~)]=k(x~,Xθ)Kf,f1f\textcolor{green}{\mathbb{E}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[f(\tilde{\bm{x}})]} = k(\tilde{\bm{x}},\bm{X}|\theta) \bm{K}_{f,f}^{-1} \bm{f}
  • conditional covariance function: Covf~f,θ,ϕ[f(x~)]=k(x~,x~)k(x~,Xθ)Kf,f1k(X,x~θ)\textcolor{green}{\text{Cov}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[f(\tilde{\bm{x}})]} = k(\tilde{\bm{x}},\tilde{\bm{x}}') - k(\tilde{\bm{x}},\bm{X}|\theta)\bm{K}_{f,f}^{-1}k(\bm{X},\tilde{\bm{x}}'|\theta) (不确定等号左边的符号是否正确)

以上只是纯理论推导,在没有获得相应的观察的情况下,所以没有涉及到y\bm{y},只涉及到了 latent variables f(x)\bm{f}(\bm{x}) and latent function f\bm{f}
以下开始考虑取得观测数据后的推理。

First inference step is to form the conditional posterior of the latenet variables f\bm{f} given the parameters ϑ\vartheta (D{X,y}\mathcal{D}\triangleq\{\bm{X},\bm{y}\}) (这里暂时假设已经取得,依赖于 observation model 的选择或设计,在后面会讨论如何计算,实际上除了经典GP都需要用近似方法)

p(fD,θ,ϕ)=p(yf,ϕ)observation modelp(fX,θ)GP priorp(yf,ϕ)p(fX,θ)dfp(yX,ϑ)(8)p(\bm{f}|\mathcal{D},\theta,\phi) = \frac{ \overbrace{p(\bm{y}|\bm{f},\phi)}^\text{observation model} \overbrace{p(\bm{f}|\bm{X},\theta)}^\text{GP prior} }{ \int p(\bm{y}|\bm{f},\phi) p(\bm{f}|\bm{X},\theta) d\bm{f} \textcolor{green}{\triangleq p(\bm{y}|\bm{X},\vartheta)}} \tag{8}

After this, we can marginalize over the parameters ϑ\vartheta to obtain the marginal posterior distribution for the latent vriables f\bm{f}

p(fD)=p(fD,θ,ϕ)see abovep(θ,ϕD)hyperpriordθdϕp(\bm{f}|\mathcal{D}) = \int \overbrace{p(\bm{f}|\mathcal{D},\theta,\phi)}^\text{see above} \overbrace{p(\theta,\phi|\mathcal{D})}^\text{hyperprior} d\theta d\phi

The conditional posterior predictive distribution p(f~D,ϑ,x~)p(\tilde{f}|\mathcal{D},\vartheta,\tilde{\bm{x}}) can be evaluated exactly or approximated, (同样,在后面会讨论如何计算,这里暂时假设已经取得)

p(f~D,ϑ,x~)=p(f~,fD,ϑ,x~)df=p(f~f,X,ϑ,x~)got from GP priorp(fD,ϑ,x~)got from Bayes’ theoremdf(Not sure if this is correct.)\color{red} \begin{aligned} p(\tilde{f}|\mathcal{D},\vartheta,\tilde{\bm{x}}) &= \int p(\tilde{f},\bm{f}|\mathcal{D},\vartheta,\tilde{\bm{x}}) \, d\bm{f} \\ &= \int \overbrace{ p(\tilde{f}|\bm{f},\bm{X},\vartheta,\tilde{\bm{x}}) }^\text{got from GP prior} \cdot \overbrace{ p(\bm{f}|\mathcal{D},\vartheta,\tilde{\bm{x}}) }^\text{got from Bayes' theorem} \, d\bm{f} \\ &\text{(Not sure if this is correct.)} \end{aligned}

The posterior predictive distribution p(f~D,x~)p(\tilde{f}|\mathcal{D},\tilde{\bm{x}}) is obtained by marginalizing out the parameters ϑ\vartheta from p(f~D,ϑ,x~)p(\tilde{f}|\mathcal{D},\vartheta,\tilde{\bm{x}}).

The posterior joint predictive distribution p(y~D,θ,ϕ,x~)p(\tilde{\bm{y}}|\mathcal{D},\theta,\phi,\tilde{\bm{x}}) requires integration over p(f~D,ϑ,x~)p(\tilde{f}|\mathcal{D},\vartheta,\tilde{\bm{x}}). (Usually not used.)

The marginal predicted distribution for individual y~i\tilde{y}_i is

p(y~iD,x~i,θ,ϕ)=p(y~if~i,ϕ)p(f~iD,x~i,θ,ϕ)df~ip(\tilde{y}_i|\mathcal{D},\tilde{\bm{x}}_i,\theta,\phi) = \int p(\tilde{y}_i|\tilde{f}_i,\phi) p(\tilde{f}_i|\mathcal{D},\tilde{\bm{x}}_i,\theta,\phi) \, d\tilde{f}_i

If the parameters are considered fixed, using GP’s marginalization and conditionalization properties (still Gaussian), we can evaluate the posterior predictive mean mp(f~D,θ,ϕ)m_p(\tilde{f}|\mathcal{D},\theta,\phi) from the conditional mean Ef~f,θ,ϕ[f~]\color{green}\mathbb{E}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}] (where f~f(x~)\tilde{\bm{f}} \triangleq f(\tilde{\bm{x}})) (前面推导已经得到), through marginalizing out the latent variables f\bm{f},

mp(f~D,θ,ϕ)=Ef~f,θ,ϕ[f~]p(fD,θ,ϕ)df=代入并保留只与f相关的量k(x~,Xθ)Kf,f1EfD,θ,ϕ[f](11)m_p(\tilde{f}|\mathcal{D},\theta,\phi) = \int \textcolor{green}{\mathbb{E}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}]}\, p(\bm{f}|\mathcal{D},\theta,\phi) d\bm{f} \xlongequal{\text{\color{red}代入并保留只与$f$相关的量}} k(\tilde{\bm{x}},\bm{X}|\theta) \bm{K}_{f,f}^{-1} \mathbb{E}_{\bm{f}|\mathcal{D},\theta,\phi}[\bm{f}] \tag{11}

The posterior predictive covariance between any set of latent variables f~\tilde{\bm{f}} is (这一步的推导是利用了Wikipedia: Law of total covariance

Covf~D,θ,ϕ[f~]=EfD,θ,ϕ[Covf~f,θ,ϕ[f~]]+CovfD,θ,ϕ[Ef~f,θ,ϕ[f~]]=Covf~f,θ,ϕ[f~]independent of f+CovfD,θ,ϕ[k(x~,Xθ)Kf,f1f]]k(x~,X)Kf,f1CovfD,θ,ϕ[f]Kf,f1k(X,x~)\begin{aligned} \text{Cov}_{\tilde{\bm{f}}|\mathcal{D},\theta,\phi} [\tilde{\bm{f}}] &= \mathbb{E}_{\bm{f}|\mathcal{D},\theta,\phi}\left[ \textcolor{green}{\text{Cov}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}]} \right] + \text{Cov}_{\bm{f}|\mathcal{D},\theta,\phi}\left[ \textcolor{green}{\mathbb{E}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}]} \right] \\ &= \overbrace{\textcolor{green}{\text{Cov}_{\tilde{\bm{f}}|\bm{f},\theta,\phi}[\tilde{\bm{f}}]}}^{\text{independent of $\bm{f}$}} + \underbrace{\text{Cov}_{\bm{f}|\mathcal{D},\theta,\phi}\left[ k(\tilde{\bm{x}},\bm{X}|\theta) \bm{K}_{f,f}^{-1} \bm{f}] \right]}_{k(\tilde{\bm{x}},\bm{X})\bm{K}_{f,f}^{-1}\text{Cov}_{\bm{f}|\mathcal{D},\theta,\phi}[\bm{f}]\bm{K}_{f,f}^{-1}k(\bm{X},\tilde{\bm{x}}')} \end{aligned}

参考 Wikipedia: Law of total covarianceCov[X,Y]=E[Cov[X,YZ]]+Cov[E[XZ],E[YZ]]\text{Cov}[X,Y] = \mathbb{E}\left[ \text{Cov}[X,Y|Z]] + \text{Cov}[\mathbb{E}[X|Z], \mathbb{E}[Y|Z] \right]

Then, the posterior predictive covariance function kp(x~,x~D,θ,ϕ)k_p(\tilde{\bm{x}},\tilde{\bm{x}}'|\mathcal{D},\theta,\phi) is

kp(x~,x~D,θ,ϕ)=k(x~,x~θ)k(x~,Xθ)(Kf,f1Kf,f1CovfD,θ,ϕKf,f1)k(X,x~θ)(13)k_p(\tilde{\bm{x}},\tilde{\bm{x}}'|\mathcal{D},\theta,\phi) = k(\tilde{\bm{x}},\tilde{\bm{x}}'|\theta) - k(\tilde{\bm{x}},\bm{X}|\theta)\left( \bm{K}_{f,f}^{-1}-\bm{K}_{f,f}^{-1}\text{Cov}_{\bm{f}|\mathcal{D},\theta,\phi}\bm{K}_{f,f}^{-1} \right) k(\bm{X},\tilde{\bm{x}}'|\theta) \tag{13}

So, even if the exact posterior p(f~D,θ,ϕ)isnotavailablep(\tilde{f}|\mathcal{D},\theta,\phi) is not available

From latents y\bm{y} to observations y\bm{y}

Bayes

Gaussian observation model: yiN(fi,σ2)y_i \sim N(f_i,\sigma^2)

p(yifi,θ,ϕincludes σ)=p(y_i|f_i,\theta,\overbrace{\phi}^\text{includes $\sigma$}) = \dots

Marginal likelihood p(yX,θ,σ2)p(\bm{y}|\bm{X},\theta,\sigma^2) is

p(yX,θ,σ2)=N(y0,Kf,f+σ2I)p(\bm{y}|\bm{X},\theta,\sigma^2) = N(\bm{y}|\bm{0},\bm{K}_{f,f}+\sigma^2\bm{I})

The conditional posterior of latent variables f\bm{f} has analytical solution now, (should be done through completing the square. Bishop’s book or GPML book should have details.)

fD,θ,ϕN(Kf,f(Kf,f+σ2I)1y,Kf,fKf,f(Kf,f+σ2I)1Kf,f)(15)\bm{f}|\mathcal{D},\theta,\phi \sim N( \bm{K}_{f,f}(\bm{K}_{f,f}+\sigma^2\bm{I})^{-1}\bm{y},\quad \bm{K}_{f,f}-\bm{K}_{f,f}(\bm{K}_{f,f}+\sigma^2\bm{I})^{-1}\bm{K}_{f,f} ) \tag{15}

Since the conditional posterior of f\bm{f} is Gaussian, the posterior process is still a GP, whose mean and covariance function is obtained from Eqs. (11) and (13).

f~D,θ,ϕGP(mp(x~),kp(x~,x~)))(16)\tilde{f}|\mathcal{D},\theta,\phi \sim \mathcal{GP}(m_p(\tilde{\bm{x}}),\quad k_p(\tilde{\bm{x}},\tilde{\bm{x}}'))) \tag{16}

以上公式(15)直接给出了 EfD,θ,ϕ[f]\mathbb{E}_{\bm{f}|\mathcal{D},\theta,\phi}[\bm{f}]CovfD,θ,ϕ[f]\text{Cov}_{\bm{f}|\mathcal{D},\theta,\phi}[\bm{f}],代入到(11)和(13)中就得到mpm_pkpk_p

Learning Materials

A Practical Guide to Gaussian Processes

  • 对原理的讲解很精辟,简练
  • 有实际使用建议
    • 对初始化的建议
  • 大量使用后才能体会到文章中的精髓

A Visual Exploration of Gaussian Processes

  • 即时可视化,互动
  • 讲解不如上一个链接

Zoubin Ghahramani, “A Tutorial on Gaussian Processes (or Why I Don’t Use SVMs)”, 2011.

Relationships


My other individual posts

Mainly about GP

GP Mentioned

Reading Notes | PRML (Bishop 2006)