Neighborhood Component Analysis (NCA) Feature Selection

Neighborhood component analysis (NCA) is a non-parametric and embedded method for selecting features with the goal of maximizing prediction accuracy of regression and classification algorithms. The Statistics and Machine Learning Toolbox™ functions fscnca and fsrnca perform NCA feature selection with regularization to learn feature weights for minimization of an objective function that measures the average leave-one-out classification or regression loss over the training data.

NCA Feature Selection for Classification

Consider a multi-class classification problem with a training set containing n observations:

S={(xi,yi),i=1,2,,n},

where xip are the feature vectors, yi{1,2,,c} are the class labels, and c is the number of classes. The aim is to learn a classifier f:p{1,2,,c} that accepts a feature vector and makes a prediction f(x) for the true label y of x.

Consider a randomized classifier that:

  • Randomly picks a point, Ref(x), from S as the ‘reference point’ for x

  • Labels x using the label of the reference point Ref(x).

This scheme is similar to that of a 1-NN classifier where the reference point is chosen to be the nearest neighbor of the new point x. In NCA, the reference point is chosen randomly and all points in S have some probability of being selected as the reference point. The probability P(Ref(x)= xj|S) that point xj is picked from S as the reference point for x is higher if xj is closer to x as measured by the distance function dw, where

dw(xi,xj)=r=1pwr2|xirxjr|,

and wr are the feature weights. Assume that

P(Ref(x)= xj|S)k(dw(x,xj)),

where k is some kernel or a similarity function that assumes large values when dw(x,xj) is small. Suppose it is

k(z)=exp(zσ),

as suggested in [1]. The reference point for x is chosen from S, so sum of P(Ref(x)= xj|S) for all j must be equal to 1. Therefore, it is possible to write

P(Ref(x)= xj|S)=k(dw(x,xj))j=1nk(dw(x,xj)).

Now consider the leave-one-out application of this randomized classifier, that is, predicting the label of xi using the data in Si, the training set S excluding the point (xi,yi). The probability that point xj is picked as the reference point for xi is

pij=P(Ref(xi)= xj|Si)=k(dw(xi,xj))j=1,jink(dw(xi,xj)).

The average leave-one-out probability of correct classification is the probability pi that the randomized classifier correctly classifies observation i using Si.

pi=j=1,jinP(Ref(xi)=xj|Si)I(yi=yj)=j=1,jinpijyij,

where

yij=I(yi=yj)={1ifyi=yj,0otherwise.

The average leave-one-out probability of correct classification using the randomized classifier can be written as

F(w)=1ni=1npi.

The right hand side of F(w) depends on the weight vector w. The goal of neighborhood component analysis is to maximize F(w) with respect to w. fscnca uses the regularized objective function as introduced in [1].

F(w)=1ni=1npiλr=1pwr2=1ni=1n[j=1,jinpijyijλr=1pwr2]Fi(w)=1ni=1nFi(w),

where λ is the regularization parameter. The regularization term drives many of the weights in w to 0.

After choosing the kernel parameter σ in pij as 1, finding the weight vector w can be expressed as the following minimization problem for given λ.

w^=argminwf(w)=argminw1ni=1nfi(w),

where f(w) = -F(w) and fi(w) = -Fi(w).

Note that

1ni=1nj=1,jinpij=1,

and the argument of the minimum does not change if you add a constant to an objective function. Therefore, you can rewrite the objective function by adding the constant 1.

w^=argminw{1+f(w)}=argminw{1ni=1nj=1,jinpij1ni=1nj=1,jinpijyij+λr=1pwr2}=argminw{1ni=1nj=1,jinpij(1yij)+λr=1pwr2}=argminw{1ni=1nj=1,jinpijl(yi,yj)+λr=1pwr2},

where the loss function is defined as

l(yi,yj)={1ifyiyj,0otherwise.

The argument of the minimum is the weight vector that minimizes the classification error. You can specify a custom loss function using the LossFunction name-value pair argument in the call to fscnca.

NCA Feature Selection for Regression

The fsrnca function performs NCA feature selection modified for regression. Given n observations

S={(xi,yi),i=1,2,,n},

the only difference from the classification problem is that the response values yi are continuous. In this case, the aim is to predict the response y given the training set S.

Consider a randomized regression model that:

  • Randomly picks a point (Ref(x)) from Sas the ‘reference point’ for x

  • Sets the response value at x equal to the response value of the reference point Ref(x).

Again, the probability P(Ref(x)= xj|S) that point xj is picked from S as the reference point for x is

P(Ref(x)= xj|S)=k(dw(x,xj))j=1nk(dw(x,xj)).

Now consider the leave-one-out application of this randomized regression model, that is, predicting the response for xi using the data in Si, the training set S excluding the point (xi,yi). The probability that point xj is picked as the reference point for xi is

pij=P(Ref(xi)= xj|Si)=k(dw(xi,xj))j=1,jink(dw(xi,xj)).

Let y^i be the response value the randomized regression model predicts and yi be the actual response for xi. And let l:2 be a loss function that measures the disagreement between y^i and yi. Then, the average value of l(yi,y^i) is

li=E(l(yi,y^i)|Si)=j=1,jinpijl(yi,yj).

After adding the regularization term, the objective function for minimization is:

f(w)=1ni=1nli+λr=1pwr2.

The default loss function l(yi,yj) for NCA for regression is mean absolute deviation, but you can specify other loss functions, including a custom one, using the LossFunction name-value pair argument in the call to fsrnca.

Impact of Standardization

The regularization term derives the weights of irrelevant predictors to zero. In the objective functions for NCA for classification or regression, there is only one regularization parameter λ for all weights. This fact requires the magnitudes of the weights to be comparable to each other. When the feature vectors xi in S are in different scales, this might result in weights that are in different scales and not meaningful. To avoid this situation, standardize the predictors to have zero mean and unit standard deviation before applying NCA. You can standardize the predictors using the 'Standardize',true name-value pair argument in the call to fscnca or fsrnca.

Choosing the Regularization Parameter Value

It is usually necessary to select a value of the regularization parameter by calculating the accuracy of the randomized NCA classifier or regression model on an independent test set. If you use cross-validation instead of a single test set, select the λ value that minimizes the average loss across the cross-validation folds. For examples, see Tune Regularization Parameter to Detect Features Using NCA for Classification and Tune Regularization Parameter in NCA for Regression.

References

[1] Yang, W., K. Wang, W. Zuo. "Neighborhood Component Feature Selection for High-Dimensional Data." Journal of Computers. Vol. 7, Number 1, January, 2012.

See Also

|