Subset of Regressors Approximation for GPR Models

The subset of regressors (SR) approximation method consists of replacing the kernel function $k (x, x_{r} | θ)$ in the exact GPR method by its approximation ${\hat{k}}_{S R} (x, x_{r} | θ, A)$ , given the active set $A \subset N = {1, 2, ..., n}$ . You can specify the SR method for parameter estimation by using the 'FitMethod','sr' name-value pair argument in the call to fitrgp. For prediction using SR, you can use the 'PredictMethod','sr' name-value pair argument in the call to fitrgp.

Approximating the Kernel Function

For the exact GPR model, the expected prediction in GPR depends on the set of $N$ functions $S_{N} = {k (x, x_{i} | θ), i = 1, 2, \dots, n}$ , where $N = {1, 2, ..., n}$ is the set of indices of all observations, and n is the total number of observations. The idea is to approximate the span of these functions by a smaller set of functions, $S_{A}$ , where $A \subset N = {1, 2, ..., n}$ is the subset of indices of points selected to be in the active set. Consider $S_{A} = {k (x, x_{j} | θ), j \in A}$ . The aim is to approximate the elements of $S_{N}$ as linear combinations of the elements of $S_{A}$ .

Suppose the approximation to $k (x, x_{r} | θ)$ using the functions in $S_{A}$ is as follows:

$\hat{k} (x, x_{r} | θ) = \sum_{j \in A} c_{j r} k (x, x_{j} | θ),$

where $c_{j r} \in ℝ$ are the coefficients of the linear combination for approximating $k (x, x_{r} | θ)$ . Suppose $C$ is the matrix that contains all the coefficients $c_{j r}$ . Then, $C$ , is a $| A | \times n$ matrix such that $C (j, r) = c_{j r}$ . The software finds the best approximation to the elements of $S_{N}$ using the active set $A \subset N = {1, 2, ..., n}$ by minimizing the error function

$E (A, C) = \sum_{r = 1}^{n} {‖ k (x, x_{r} | θ) - \hat{k} (x, x_{r} | θ) ‖}_{ℋ}^{2},$

where $ℋ$ is the Reproducing Kernel Hilbert Spaces (RKHS) associated with the kernel function k [1], [2].

The coefficient matrix that minimizes $E (A, C)$ is

${\hat{C}}_{A} = K {(X_{A}, X_{A} | θ)}^{- 1} K (X_{A}, X | θ),$

and an approximation to the kernel function using the elements in the active set $A \subset N = {1, 2, ..., n}$ is

$\hat{k} (x, x_{r} | θ) = \sum_{j \in A} c_{j r} k (x, x_{j} | θ) = K (x^{T}, X_{A} | θ) C (:, r) .$

The SR approximation to the kernel function using the active set $A \subset N = {1, 2, ..., n}$ is defined as:

${\hat{k}}_{S R} (x, x_{r} | θ, A) = K (x^{T}, X_{A} | θ) {\hat{C}}_{A} (:, r) = K (x^{T}, X_{A} | θ) K {(X_{A}, X_{A} | θ)}^{- 1} K (X_{A}, x_{r}^{T} | θ)$

and the SR approximation to $K (X, X | θ)$ is:

${\hat{K}}_{S R} (X, X | θ, A) = K (X, X_{A} | θ) K {(X_{A}, X_{A} | θ)}^{- 1} K (X_{A}, X | θ) .$

Parameter Estimation

Replacing $K (X, X | θ)$ by ${\hat{K}}_{S R} (X, X | θ, A)$ in the marginal log likelihood function produces its SR approximation:

$\begin{array}{l} \log P_{S R} (y | X, β, θ, σ^{2}, A) = & - \frac{1}{2} {(y - H β)}^{T} {[{\hat{K}}_{S R} (X, X | θ, A) + σ^{2} I_{n}]}^{- 1} (y - H β) \\ - \frac{N}{2} \log 2 π - \frac{1}{2} \log | {\hat{K}}_{S R} (X, X | θ, A) + σ^{2} I_{n} | \end{array}$

As in the exact method, the software estimates the parameters by first computing $\hat{β} (θ, σ^{2})$ , the optimal estimate of $β$ , given $θ$ and $σ^{2}$ . Then it estimates $θ$ , and $σ^{2}$ using the $β$ -profiled marginal log likelihood. The SR estimate to $β$ for given $θ$ , and $σ^{2}$ is:

${\hat{β}}_{S R} (θ, σ^{2}, A) = {[\underset{*}{\underset{︸}{H^{T} {[{\hat{K}}_{S R} (X, X | θ, A) + σ^{2} I_{n}]}^{- 1} H}}]}^{- 1} \underset{* *}{\underset{︸}{H^{T} {[{\hat{K}}_{S R} (X, X | θ, A) + σ^{2} I_{n}]}^{- 1} y}},$

where

$\begin{array}{l} {[{\hat{K}}_{S R} (X, X | θ, A) + σ^{2} I_{n}]}^{- 1} = \frac{I_{N}}{σ^{2}} - \frac{K (X, X_{A} | θ)}{σ^{2}} A_{A}^{- 1} \frac{K (X_{A}, X | θ)}{σ^{2}}, \\ A_{A} = K (X_{A}, X_{A} | θ) + \frac{K (X_{A}, X | θ) K (X, X_{A} | θ)}{σ^{2}}, \\ * = \frac{H^{T} H}{σ^{2}} - \frac{H^{T} K (X, X_{A} | θ)}{σ^{2}} A_{A}^{- 1} \frac{K (X_{A}, X | θ) H}{σ^{2}}, \\ * * = \frac{H^{T} y}{σ^{2}} - \frac{H^{T} K (X, X_{A} | θ)}{σ^{2}} A_{A}^{- 1} \frac{K (X_{A}, X | θ) y}{σ^{2}} . \end{array}$

And the SR approximation to the $β$ -profiled marginal log likelihood is:

$\begin{array}{l} \log P_{S R} (y | X, {\hat{β}}_{S R} (θ, σ^{2}, A), θ, σ^{2}, A) = \\ \begin{array}{l} - \frac{1}{2} {(y - H {\hat{β}}_{S R} (θ, σ^{2}, A))}^{T} {[{\hat{K}}_{S R} (X, X | θ, A) + σ^{2} I_{n}]}^{- 1} (y - H {\hat{β}}_{S R} (θ, σ^{2}, A)) \\ - \frac{N}{2} \log 2 π - \frac{1}{2} \log | {\hat{K}}_{S R} (X, X | θ, A) + σ^{2} I_{n} | . \end{array} \end{array}$

Prediction

The SR approximation to the distribution of $y_{n e w}$ given $y$ , $X$ , $x_{n e w}$ is

$P (y_{n e w} | y, X, x_{n e w}) = N (y_{n e w} | h {(x_{n e w})}^{T} β + μ_{S R}, σ_{n e w}^{2} + Σ_{S R}),$

where $μ_{S R}$ and $Σ_{S R}$ are the SR approximations to $μ$ and $Σ$ shown in prediction using the exact GPR method.

$μ_{S R}$ and $Σ_{S R}$ are obtained by replacing $k (x, x_{r} | θ)$ by its SR approximation ${\hat{k}}_{S R} (x, x_{r} | θ, A)$ in $μ$ and $Σ$ , respectively.

That is,

$μ_{S R} = \underset{(1)}{\underset{︸}{{\hat{K}}_{S R} (x_{n e w}^{T}, X | θ, A)}} \underset{(2)}{\underset{︸}{{({\hat{K}}_{S R} (X, X | θ, A) + σ^{2} I_{N})}^{- 1}}} (y - H β) .$

Since

$\begin{array}{l} (1) = K (x_{n e w}^{T}, X_{A} | θ) K {(X_{A}, X_{A} | θ)}^{- 1} K (X_{A}, X | θ) \end{array},$

$(2) = \frac{I_{N}}{σ^{2}} - \frac{K (X, X_{A} | θ)}{σ^{2}} {[K (X_{A}, X_{A} | θ) + \frac{K (X_{A}, X | θ) K (X, X_{A} | θ)}{σ^{2}}]}^{- 1} \frac{K (X_{A}, X | θ)}{σ^{2}},$

and from the fact that $I_{N} - B {(A + B)}^{- 1} = A {(A + B)}^{- 1}$ , $μ_{S R}$ can be written as

$\begin{array}{l} μ_{S R} & = K (x_{n e w}^{T}, X_{A} | θ) {[K (X_{A}, X_{A} | θ) + \frac{K (X_{A}, X | θ) K (X, X_{A} | θ)}{σ^{2}}]}^{- 1} \frac{K (X_{A}, X | θ)}{σ^{2}} (y - H β) \end{array} .$

Similarly, $Σ_{S R}$ is derived as follows:

$Σ_{S R} = \underset{*}{\underset{︸}{{\hat{k}}_{S R} (x_{n e w}, x_{n e w} | θ, A)}} - \underset{* *}{\underset{︸}{{\hat{K}}_{S R} (x_{n e w}^{T}, X | θ, A)}} \underset{* * *}{\underset{︸}{{({\hat{K}}_{S R} (X, X | θ, A) + σ^{2} I_{N})}^{- 1}}} \underset{* * * *}{\underset{︸}{{\hat{K}}_{S R} (X, x_{n e w}^{T} | θ, A)}} .$

Because

$* = K (x_{n e w}^{T}, X_{A} | θ) K {(X_{A}, X_{A} | θ)}^{- 1} K (X_{A}, x_{n e w}^{T} | θ),$

$\begin{array}{l} * * = K (x_{n e w}^{T}, X_{A} | θ) K {(X_{A}, X_{A} | θ)}^{- 1} K (X_{A}, X | θ), \\ * * * = (2) in the equation of μ_{S R}, \end{array}$

$* * * * = K (X, X_{A} | θ) K {(X_{A}, X_{A} | θ)}^{- 1} K (X_{A}, x_{n e w}^{T} | θ),$

$Σ_{S R}$ is found as follows:

$\sum_{S R} = K (x_{n e w}^{T}, X_{A} | θ) {[K (X_{A}, X_{A} | θ) + \frac{K (X_{A}, X | θ) K (X, X_{A} | θ))}{σ^{2}}]}^{- 1} K (X_{A}, x_{n e w}^{T} | θ) .$

Predictive Variance Problem

One of the disadvantages of the SR method is that it can give unreasonably small predictive variances when making predictions in a region far away from the chosen active set $A \subset N = {1, 2, ..., n}$ . Consider making a prediction at a new point $x_{n e w}$ that is far away from the training set $X$ . In other words, assume that $K (x_{n e w}^{T}, X | θ) \approx 0$ .

For exact GPR, the posterior distribution of $f_{n e w}$ given $y$ , $X$ and $x_{n e w}$ would be Normal with mean $μ = 0$ and variance $Σ = k (x_{n e w}, x_{n e w} | θ)$ . This value is correct in the sense that, if $x_{n e w}$ is far from $X$ , then the data $(X, y)$ does not supply any new information about $f_{n e w}$ and so the posterior distribution of $f_{n e w}$ given $y$ , $X$ , and $x_{n e w}$ should reduce to the prior distribution $f_{n e w}$ given $x_{n e w}$ , which is a Normal distribution with mean $0$ and variance $k (x_{n e w}, x_{n e w} | θ)$ .

For the SR approximation, if $x_{n e w}$ is far away from $X$ (and hence also far away from $X_{A}$ ), then $μ_{S R} = 0$ and $Σ_{S R} = 0$ . Thus in this extreme case, $μ_{S R}$ agrees with $μ$ from exact GPR, but $Σ_{S R}$ is unreasonably small compared to $Σ$ from exact GPR.

The fully independent conditional approximation method can help avoid this problem.

References

[1] Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press. Cambridge, Massachusetts, 2006.

[2] Smola, A. J. and B. Schökopf. "Sparse greedy matrix approximation for machine learning." In Proceedings of the Seventeenth International Conference on Machine Learning, 2000.