Exact GPR Method

An instance of response y from a Gaussian process regression (GPR) model can be modeled as

$P (y_{i} | f (x_{i}), x_{i}) ~ N (y_{i} | h {(x_{i})}^{T} β + f (x_{i}), σ^{2})$

Hence, making predictions for new data from a GPR model requires:

Knowledge of the coefficient vector, $β$ , of fixed basis functions
Ability to evaluate the covariance function $k (x, x^{'} | θ)$ for arbitrary $x$ and $x^{'}$ , given the kernel parameters or hyperparameters, $θ$ .
Knowledge of the noise variance $σ^{2}$ that appears in the density $P (y_{i} | f (x_{i}), x_{i})$

That is, one needs first to estimate $β$ , $θ$ , and $σ^{2}$ from the data $(X, y)$ .

Parameter Estimation

One approach for estimating the parameters $β$ , $θ$ , and $σ^{2}$ of a GPR model is by maximizing the likelihood $P (y | X)$ as a function of $β$ , $θ$ , and $σ^{2}$ [1]. That is, if $\hat{β}$ , $\hat{θ}$ , and ${\hat{σ}}^{2}$ are the estimates of $β$ , $θ$ , and $σ^{2}$ , respectively, then:

$\hat{β}, \hat{θ}, {\hat{σ}}^{2} = \underset{β, θ, σ^{2}}{arg max} \log P (y | X, β, θ, σ^{2}) .$

Because

$P (y | X) = P (y | X, β, θ, σ^{2}) = N (y | H β, K (X, X | θ) + σ^{2} I_{n}),$

the marginal log likelihood function is as follows:

$\begin{array}{l} \log P (y | X, β, θ, σ^{2}) = & - \frac{1}{2} {(y - H β)}^{T} {[K (X, X | θ) + σ^{2} I_{n}]}^{- 1} (y - H β) \\ - \frac{n}{2} \log 2 π - \frac{1}{2} \log | K (X, X | θ) + σ^{2} I_{n} | . \end{array}$

where $H$ is the vector of explicit basis functions, and $K (X, X | θ)$ is the covariance function matrix (for more information, see Gaussian Process Regression Models).

To estimate the parameters, the software first computes $\hat{β} (θ, σ^{2})$ , which maximizes the log likelihood function with respect to $β$ for given $θ$ and $σ^{2}$ . It then uses this estimate to compute the $β$ -profiled likelihood:

$\log {P (y | X, \hat{β} (θ, σ^{2}), θ, σ^{2})} .$

The estimate of $β$ for given $θ$ , and $σ^{2}$ is

$\hat{β} (θ, σ^{2}) = {[H^{T} {[K (X, X | θ) + σ^{2} I_{n}]}^{- 1} H]}^{- 1} H^{T} {[K (X, X | θ) + σ^{2} I_{n}]}^{- 1} y .$

Then, the $β$ -profiled log likelihood is given by

$\begin{array}{l} \log P (y | X, \hat{β} (θ, σ^{2}), θ, σ^{2}) = & - \frac{1}{2} {(y - H \hat{β} (θ, σ^{2}))}^{T} {[K (X, X | θ) + σ^{2} I_{n}]}^{- 1} (y - H \hat{β} (θ, σ^{2})) \\ - \frac{n}{2} \log 2 π - \frac{1}{2} \log | K (X, X | θ) + σ^{2} I_{n} | \end{array}$

The software then maximizes the $β$ -profiled log-likelihood over $θ$ , and $σ^{2}$ to find their estimates.

Prediction

Making probabilistic predictions from a GPR model with known parameters requires the density $P (y_{n e w} | y, X, x_{n e w})$ . Using the definition of conditional probabilities, one can write:

$P (y_{n e w} | y, X, x_{n e w}) = \frac{P (y_{n e w}, y | X, x_{n e w})}{P (y | X, x_{n e w})} .$

To find the joint density in the numerator, it is necessary to introduce the latent variables $f_{n e w}$ and $f$ corresponding to $y_{n e w}$ , and $y$ , respectively. Then, it is possible to use the joint distribution for $y_{n e w}$ , $y$ , $f_{n e w}$ , and $f$ to compute $P (y_{n e w}, y | X, x_{n e w})$ :

$\begin{array}{l} \begin{array}{l} P (y_{n e w}, y | X, x_{n e w}) & = \int \int P (y_{n e w}, y, f_{n e w}, f | X, x_{n e w}) d f d f_{n e w} \\ = \int \int P (y_{n e w}, y | f_{n e w}, f, X, x_{n e w}) P (f_{n e w}, f | X, x_{n e w}) d f d f_{n e w} . \end{array} \end{array}$

Gaussian process models assume that each response $y_{i}$ only depends on the corresponding latent variable $f_{i}$ and the feature vector $x_{i}$ . Writing $P (y_{n e w}, y | f_{n e w}, f, X, x_{n e w})$ as a product of conditional densities and based on this assumption produces:

$\begin{array}{l} P (y_{n e w}, y | f_{n e w}, f, X, x_{n e w}) = P (y_{n e w} | f_{n e w}, x_{n e w}) \prod_{i = 1}^{n} P (y_{i} | f (x_{i}), x_{i}) \end{array} .$

After integrating with respect to $y_{n e w}$ , the result only depends on $f$ and $X$ :

$\begin{array}{l} P (y | f, X) = \prod_{i = 1}^{n} P (y_{i} | f_{i}, x_{i}) = \prod_{i = 1}^{n} N (y_{i} {| h (x_{i})}^{T} β + f_{i}, σ^{2}) \end{array} .$

Hence,

$P (y_{n e w}, y | f_{n e w}, f, X, x_{n e w}) = P (y_{n e w} | f_{n e w}, x_{n e w}) P (y | f, X) .$

Again using the definition of conditional probabilities,

$P (f_{n e w}, f | X, x_{n e w}) = P (f_{n e w} | f, X, x_{n e w}) * P (f | X, x_{n e w}),$

it is possible to write $P (y_{n e w}, y | X, x_{n e w})$ as follows:

$P (y_{n e w}, y | X, x_{n e w}) = \int \int P (y_{n e w} | f_{n e w}, x_{n e w}) P (y | f, X) P (f_{n e w} | f, X, x_{n e w}) P (f | X, x_{n e w}) d f d f_{n e w} .$

Using the facts that

$P (f | X, x_{n e w}) = P (f | X)$

and

$P (y | f, X) P (f | X) = P (y, f | X) = P (f | y, X) P (y | X),$

one can rewrite $P (y_{n e w}, y | X, x_{n e w})$ as follows:

$P (y_{n e w}, y | X, x_{n e w}) = P (y | X) \int \int P (y_{n e w} | f_{n e w}, x_{n e w}) P (f | y, X) P (f_{n e w} | f, X, x_{n e w}) d f d f_{n e w} .$

It is also possible to show that

$P (y | X, x_{n e w}) = P (y | X) .$

Hence, the required density $P (y_{n e w} | y, X, x_{n e w})$ is:

$\begin{array}{l} P (y_{n e w} | y, X, x_{n e w}) & = \frac{P (y_{n e w}, y | X, x_{n e w})}{P (y | X, x_{n e w})} = \frac{P (y_{n e w}, y | X, x_{n e w})}{P (y | X)} \\ = \int \int \underset{(1)}{\underset{︸}{P (y_{n e w} | f_{n e w}, x_{n e w})}} \underset{(2)}{\underset{︸}{P (f | y, X)}} \underset{(3)}{\underset{︸}{P (f_{n e w} | f, X, x_{n e w})}} d f d f_{n e w} . \end{array}$

It can be shown that

$(1) P (y_{n e w} | f_{n e w}, x_{n e w}) = N (y_{n e w} | h {(x_{n e w})}^{T} β + f_{n e w}, σ_{n e w}^{2})$

$(2) P (f | y, X) = N (f | \frac{1}{σ^{2}} {(\frac{I_{n}}{σ^{2}} + K {(X, X)}^{- 1})}^{- 1} (y - H β), {(\frac{I_{n}}{σ^{2}} + K {(X, X)}^{- 1})}^{- 1})$

$\begin{array}{l} (3) \begin{array}{l} P (f_{n e w} | f, X, x_{n e w}) = N (f_{n e w} | K (x_{n e w}^{T}, X) K {(X, X)}^{- 1} f, Δ) \end{array}, \\ where Δ = k (x_{n e w}, x_{n e w}) - K (x_{n e w}^{T}, X) K {(X, X)}^{- 1} K (X, x_{n e w}^{T}) . \end{array}$

After the integration and required algebra, the density of the new response $y_{n e w}$ at a new point $x_{n e w}$ , given $y$ , $X$ is found as

$P (y_{n e w} | y, X, x_{n e w}) = N (y_{n e w} | h {(x_{n e w})}^{T} β + μ, σ_{n e w}^{2} + Σ),$

where

$μ = K (x_{n e w}^{T}, X) \underset{α}{\underset{︸}{{(K (X, X) + σ^{2} I_{n})}^{- 1} (y - H β)}}$

and

$Σ = k (x_{n e w}, x_{n e w}) - K (x_{n e w}^{T}, X) {(K (X, X) + σ^{2} I_{n})}^{- 1} K (X, x_{n e w}^{T}) .$

The expected value of prediction $y_{n e w}$ at a new point $x_{n e w}$ given $y$ , $X$ , and parameters $β$ , $θ$ , and $σ^{2}$ is

$\begin{array}{l} E (y_{n e w} | y, X, x_{n e w}, β, θ, σ^{2}) & = h {(x_{n e w})}^{T} β + K (x_{n e w}^{T}, X | θ) α \\ = h {(x_{n e w})}^{T} β + \sum_{i = 1}^{n} α_{i} k (x_{n e w}, x_{i} | θ), \end{array}$

where

$α = {(K (X, X | θ) + σ^{2} I_{n})}^{- 1} (y - H β) .$

Computational Complexity of Exact Parameter Estimation and Prediction

Training a GPR model with the exact method (when FitMethod is 'Exact') requires the inversion of an n-by-n kernel matrix $K (X, X)$ . The memory requirement for this step scales as O(n²) since $K (X, X)$ must be stored in memory. One evaluation of $\log P (y | X)$ scales as O(n³). Therefore, the computational complexity is O(kn³), where k is the number of function evaluations needed for maximization and n is the number of observations.

Making predictions on new data involves the computation of $\hat{α}$ . If prediction intervals are desired, this step could also involve the computation and storage of the Cholesky factor of $(K (X, X) + σ^{2} I_{n})$ for later use. The computational complexity of this step using the direct computation of $\hat{α}$ is O(n³) and the memory requirement is O(n²).

Hence, for large n, estimation of parameters or computing predictions can be very expensive. The approximation methods usually involve rearranging the computation so as to avoid the inversion of an n-by-n matrix. For the available approximation methods, see the related links at the bottom of the page.

References

[1] Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press. Cambridge, Massachusetts, 2006.

Exact GPR Method

Parameter Estimation

Prediction

Computational Complexity of Exact Parameter Estimation and Prediction

References

See Also

Topics