Neighborhood component analysis (NCA) is a non-parametric and
embedded method for selecting features with the goal of maximizing
prediction accuracy of regression and classification algorithms. The Statistics and Machine Learning Toolbox™ functions `fscnca`

and `fsrnca`

perform
NCA feature selection with regularization to learn feature weights
for minimization of an objective function that measures the average
leave-one-out classification or regression loss over the training
data.

Consider a multi-class classification problem with a training
set containing *n* observations:

$$\begin{array}{l}S=\left\{\left({x}_{i},{y}_{i}\right),i=1,2,\dots ,n\right\}\hfill \end{array},$$

where $${x}_{i}\in {\mathbb{R}}^{p}$$ are
the feature vectors, $${y}_{i}\in \{1,2,\dots ,c\}$$ are
the class labels, and *c* is the number of classes.
The aim is to learn a classifier $$f:{\mathbb{R}}^{p}\to \{1,2,\dots ,c\}$$ that
accepts a feature vector and makes a prediction $$f\left(x\right)$$ for
the true label $$y$$ of $$x$$.

Consider a randomized classifier that:

Randomly picks a point, $$\text{Ref}\left(x\right)$$, from $$S$$ as the ‘reference point’ for $$x$$

Labels $$x$$ using the label of the reference point $$\text{Ref}\left(x\right)$$.

This scheme is similar to that of a 1-NN classifier where the reference point is chosen to be the nearest neighbor of the new point $$x$$. In NCA, the reference point is chosen randomly and all points in $$S$$ have some probability of being selected as the reference point. The probability $$P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)$$ that point $${x}_{j}$$ is picked from $$S$$ as the reference point for $$x$$ is higher if $${x}_{j}$$ is closer to $$x$$ as measured by the distance function $${d}_{w}$$, where

$${d}_{w}({x}_{i},{x}_{j})={\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}\left|{x}_{ir}-{x}_{jr}\right|,$$

and $${w}_{r}$$ are the feature weights. Assume that

$$\begin{array}{l}P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)\propto k\left({d}_{w}\left(x,{x}_{j}\right)\right)\hfill \end{array},$$

where $$k$$ is some kernel or a similarity function that assumes large values when $${d}_{w}\left(x,{x}_{j}\right)$$ is small. Suppose it is

$$k\left(z\right)=\mathrm{exp}\left(-\frac{z}{\sigma}\right),$$

as suggested in [1]. The reference point
for $$x$$ is
chosen from $$S$$,
so sum of $$P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)$$ for
all *j* must be equal to 1. Therefore, it is possible
to write

$$\begin{array}{l}P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)=\frac{k\left({d}_{w}\left(x,{x}_{j}\right)\right)}{{\displaystyle \sum _{j=1}^{n}k}\left({d}_{w}\left(x,{x}_{j}\right)\right)}\hfill \end{array}.$$

Now consider the leave-one-out application of this randomized classifier, that is, predicting the label of $${x}_{i}$$ using the data in $${S}^{-i}$$, the training set $$S$$ excluding the point $$\left({x}_{i},{y}_{i}\right)$$. The probability that point $${x}_{j}$$ is picked as the reference point for $${x}_{i}$$ is

$${p}_{ij}=P\left(\text{Ref}\left({x}_{i}\right)=\text{}{x}_{j}|{S}^{-i}\right)=\frac{k\left({d}_{w}\left({x}_{i},{x}_{j}\right)\right)}{{\displaystyle \sum _{j=1,j\ne i}^{n}k}\left({d}_{w}\left({x}_{i},{x}_{j}\right)\right)}.$$

The average leave-one-out probability of correct classification
is the probability $${p}_{i}$$ that
the randomized classifier correctly classifies observation *i* using $${S}^{-i}$$.

$$\begin{array}{l}{p}_{i}={\displaystyle \sum _{j=1,j\ne i}^{n}P\left(\text{Ref}\left({x}_{i}\right)={x}_{j}|{S}^{-i}\right)I\left({y}_{i}={y}_{j}\right)}\hfill \end{array}={\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}}{y}_{ij},$$

where

$${y}_{ij}=I\left({y}_{i}={y}_{j}\right)=\{\begin{array}{cc}1& \text{if}\text{\hspace{0.17em}}{y}_{i}={y}_{j,}\\ 0& \text{otherwise}\text{.}\end{array}$$

The average leave-one-out probability of correct classification using the randomized classifier can be written as

$$F\left(w\right)=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{p}_{i}}.$$

The right hand side of $$F\left(w\right)$$ depends
on the weight vector $$w$$.
The goal of neighborhood component analysis is to maximize $$F\left(w\right)$$ with
respect to $$w$$. `fscnca`

uses the regularized objective function
as introduced in [1].

$$\begin{array}{ll}F\left(w\right)\hfill & =\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{p}_{i}-\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\hfill \\ \hfill & =\frac{1}{n}{\displaystyle \sum _{i=1}^{n}\underset{{F}_{i}\left(w\right)}{\underbrace{\left[{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}{y}_{ij}-\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\right]}}}\hfill \\ \hfill & =\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{F}_{i}\left(w\right)}\hfill \end{array},$$

where $$\lambda $$ is the regularization parameter. The regularization term drives many of the weights in $$w$$ to 0.

After choosing the kernel parameter $$\sigma $$ in $${p}_{ij}$$ as 1, finding the weight vector $$w$$ can be expressed as the following minimization problem for given $$\lambda $$.

$$\widehat{w}=\underset{w}{\text{argmin}}f\left(w\right)=\underset{w}{\text{argmin}}\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{f}_{i}\left(w\right)},$$

where *f*(*w*) =
-*F*(*w*) and *f*_{i}(*w*)
=
-*F*_{i}(*w*).

Note that

$$\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}}}=1,$$

and the argument of the minimum does not change if you add a constant to an objective function. Therefore, you can rewrite the objective function by adding the constant 1.

$$\begin{array}{c}\widehat{w}=\underset{w}{\text{argmin}}\left\{1+f(w)\right\}\\ =\underset{w}{\text{argmin}}\left\{\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}}}-\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}{y}_{ij}}+\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\right\}\\ =\underset{w}{\text{argmin}}\left\{\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}\left(1-{y}_{ij}\right)}+\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\right\}\\ =\underset{w}{\text{argmin}}\left\{\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}l({y}_{i},{y}_{j})}+\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\right\},\end{array}$$

where the loss function is defined as

$$l\left({y}_{i},{y}_{j}\right)=\{\begin{array}{cc}1& \text{if}\text{\hspace{0.17em}}{y}_{i}\ne {y}_{j,}\\ 0& \text{otherwise}\text{.}\end{array}$$

The argument of the minimum is the weight vector that minimizes
the classification error. You can specify a custom loss function using the `LossFunction`

name-value
pair argument in the call to `fscnca`

.

The `fsrnca`

function performs
NCA feature selection modified for regression. Given *n* observations

$$\begin{array}{l}S=\left\{\left({x}_{i},{y}_{i}\right),i=1,2,\dots ,n\right\}\hfill \end{array},$$

the only difference from the classification problem is that the response values $${y}_{i}\in \mathbb{R}$$ are continuous. In this case, the aim is to predict the response $$y$$ given the training set $$S$$.

Consider a randomized regression model that:

Randomly picks a point ($$\text{Ref}\left(x\right)$$) from $$S$$as the ‘reference point’ for $$x$$

Sets the response value at $$x$$ equal to the response value of the reference point $$\text{Ref}\left(x\right)$$.

Again, the probability $$P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)$$ that point $${x}_{j}$$ is picked from $$S$$ as the reference point for $$x$$ is

$$\begin{array}{l}P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)=\frac{k\left({d}_{w}\left(x,{x}_{j}\right)\right)}{{\displaystyle \sum _{j=1}^{n}k}\left({d}_{w}\left(x,{x}_{j}\right)\right)}\hfill \end{array}.$$

Now consider the leave-one-out application of this randomized regression model, that is, predicting the response for $${x}_{i}$$ using the data in $${S}^{-i}$$, the training set $$S$$ excluding the point $$\left({x}_{i},{y}_{i}\right)$$. The probability that point $${x}_{j}$$ is picked as the reference point for $${x}_{i}$$ is

$${p}_{ij}=P\left(\text{Ref}\left({x}_{i}\right)=\text{}{x}_{j}|{S}^{-i}\right)=\frac{k\left({d}_{w}\left({x}_{i},{x}_{j}\right)\right)}{{\displaystyle \sum _{j=1,j\ne i}^{n}k}\left({d}_{w}\left({x}_{i},{x}_{j}\right)\right)}.$$

Let $${\widehat{y}}_{i}$$ be the response value the randomized regression model predicts and $${y}_{i}$$ be the actual response for $${x}_{i}$$. And let $$l:{\mathbb{R}}^{2}\to \mathbb{R}$$ be a loss function that measures the disagreement between $${\widehat{y}}_{i}$$ and $${y}_{i}$$. Then, the average value of $$l\left({y}_{i},{\widehat{y}}_{i}\right)$$ is

$${l}_{i}=E\left(l\left({y}_{i},{\widehat{y}}_{i}\right)|{S}^{-i}\right)={\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}l\left({y}_{i},{y}_{j}\right).}$$

After adding the regularization term, the objective function for minimization is:

$$f\left(w\right)=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{l}_{i}+\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}.$$

The default loss function $$l\left({y}_{i},{y}_{j}\right)$$ for NCA for regression is mean absolute deviation, but you can
specify other loss functions, including a custom one, using the `LossFunction`

name-value
pair argument in the call to `fsrnca`

.

The regularization term derives the weights of irrelevant predictors
to zero. In the objective functions for NCA for classification or
regression, there is only one regularization parameter $$\lambda $$ for
all weights. This fact requires the magnitudes of the weights to be
comparable to each other. When the feature vectors $${x}_{i}$$ in $$S$$ are
in different scales, this might result in weights that are in different
scales and not meaningful. To avoid this situation, standardize the
predictors to have zero mean and unit standard deviation before applying
NCA. You can standardize the predictors using the `'Standardize',true`

name-value
pair argument in the call to `fscnca`

or `fsrnca`

.

It is usually necessary to select a value of the regularization parameter by calculating the accuracy of the randomized NCA classifier or regression model on an independent test set. If you use cross-validation instead of a single test set, select the $$\lambda $$ value that minimizes the average loss across the cross-validation folds. For examples, see Tune Regularization Parameter to Detect Features Using NCA for Classification and Tune Regularization Parameter in NCA for Regression.

[1] Yang, W., K. Wang, W. Zuo. "Neighborhood Component Feature Selection for High-Dimensional Data." Journal of Computers. Vol. 7, Number 1, January, 2012.