Feature Extraction

What Is Feature Extraction?

Feature extraction is a set of methods that map input features to new output features. Many feature extraction methods use unsupervised learning to extract features. Unlike some feature extraction methods such as PCA and NNMF, the methods described in this section can increase dimensionality (and decrease dimensionality). Internally, the methods involve optimizing nonlinear objective functions. For details, see Sparse Filtering Algorithm or Reconstruction ICA Algorithm.

One typical use of feature extraction is finding features in images. Using these features can lead to improved classification accuracy. For an example, see Feature Extraction Workflow. Another typical use is extracting individual signals from superpositions, which is often termed blind source separation. For an example, see Extract Mixed Signals.

There are two feature extraction functions: rica and sparsefilt. Associated with these functions are the objects that they create: ReconstructionICA and SparseFiltering.

Sparse Filtering Algorithm

The sparse filtering algorithm begins with a data matrix X that has n rows and p columns. Each row represents one observation and each column represents one measurement. The columns are also called the features or predictors. The algorithm then takes either an initial random p-by-q weight matrix W or uses the weight matrix passed in the InitialTransformWeights name-value pair. q is the requested number of features that sparsefilt computes.

The algorithm attempts to minimize the Sparse Filtering Objective Function by using a standard limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton optimizer. See Nocedal and Wright [2]. This optimizer takes up to IterationLimit iterations. It stops iterating earlier when it takes a step whose norm is less than StepTolerance, or when it computes that the norm of the gradient at the current point is less than GradientTolerance times a scalar τ, where

$τ = \max (1, \min (| f |, {‖ g_{0} ‖}_{\infty})) .$

|f| is the norm of the objective function, and ${‖ g_{0} ‖}_{\infty}$ is the infinity norm of the initial gradient.

The objective function attempts to simultaneously obtain few nonzero features for each data point, and for each resulting feature to have nearly equal weight. To understand how the objective function attempts to achieve these goals, see Ngiam, Koh, Chen, Bhaskar, and Ng [1].

Frequently, you obtain good features by setting a relatively small value of IterationLimit, from as low as 5 to a few hundred. Allowing the optimizer to continue can result in overtraining, where the extracted features do not generalize well to new data.

After constructing a SparseFiltering object, use the transform method to map input data to the new output features.

Sparse Filtering Objective Function

To compute an objective function, the sparse filtering algorithm uses the following steps. The objective function depends on the n-by-p data matrix X and a weight matrix W that the optimizer varies. The weight matrix W has dimensions p-by-q, where p is the number of original features and q is the number of requested features.

Compute the n-by-q matrix X*W. Apply the approximate absolute value function $ϕ (u) = \sqrt{u^{2} + 10^{- 8}}$ to each element of X*W to obtain the matrix F. ϕ is a smooth nonnegative symmetric function that closely approximates the absolute value function.
Normalize the columns of F by the approximate L² norm. In other words, define the normalized matrix $\tilde{F} (i, j)$ by
$\begin{matrix} ‖ F (j) ‖ = \sqrt{\sum_{i = 1}^{n} {(F (i, j))}^{2} + 10^{- 8}} \\ \tilde{F} (i, j) = F (i, j) / ‖ F (j) ‖ . \end{matrix}$
Normalize the rows of $\tilde{F} (i, j)$ by the approximate L² norm. In other words, define the normalized matrix $\hat{F} (i, j)$ by
$\begin{matrix} ‖ \tilde{F} (i) ‖ = \sqrt{\sum_{j = 1}^{q} {(\tilde{F} (i, j))}^{2} + 10^{- 8}} \\ \hat{F} (i, j) = \tilde{F} (i, j) / ‖ \tilde{F} (i) ‖ . \end{matrix}$
The matrix $\hat{F}$ is the matrix of converted features in X. Once sparsefilt finds the weights W that minimize the objective function h (see below), which the function stores in the output object Mdl in the Mdl.TransformWeights property, the transform function can follow the same transformation steps to convert new data to output features.
Compute the objective function h(W) as the 1–norm of the matrix $\hat{F} (i, j)$ , meaning the sum of all the elements in the matrix (which are nonnegative by construction):
$h (W) = \sum_{j = 1}^{q} \sum_{i = 1}^{n} \hat{F} (i, j) .$
If you set the Lambda name-value pair to a strictly positive value, sparsefilt uses the following modified objective function:
$h (W) = \sum_{j = 1}^{q} \sum_{i = 1}^{n} \hat{F} (i, j) + λ \sum_{j = 1}^{q} w_{j}^{T} w_{j} .$
Here, w_j is the jth column of the matrix W and λ is the value of Lambda. The effect of this term is to shrink the weights W. If you plot the columns of W as images, with positive Lambda these images appear smooth compared to the same images with zero Lambda.

Reconstruction ICA Algorithm

The Reconstruction Independent Component Analysis (RICA) algorithm is based on minimizing an objective function. The algorithm maps input data to output features.

The ICA source model is the following. Each observation x is generated by a random vector s according to

$x = μ + A s .$

x is a column vector of length p.
μ is a column vector of length p representing a constant term.
s is a column vector of length q whose elements are zero mean, unit variance random variables that are statistically independent of each other.
A is a mixing matrix of size p-by-q.

You can use this model in rica to estimate A from observations of x. See Extract Mixed Signals.

The RICA algorithm begins with a data matrix X that has n rows and p columns consisting of the observations x_i:

$X = [\begin{matrix} x_{1}^{T} \\ x_{2}^{T} \\ ⋮ \\ x_{n}^{T} \end{matrix}] .$

Each row represents one observation and each column represents one measurement. The columns are also called the features or predictors. The algorithm then takes either an initial random p-by-q weight matrix W or uses the weight matrix passed in the InitialTransformWeights name-value pair. q is the requested number of features that rica computes. The weight matrix W is composed of columns w_i of size p-by-1:

$W = [\begin{matrix} w_{1} & w_{2} & \dots & w_{q} \end{matrix}] .$

The algorithm attempts to minimize the Reconstruction ICA Objective Function by using a standard limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton optimizer. See Nocedal and Wright [2]. This optimizer takes up to IterationLimit iterations. It stops iterating when it takes a step whose norm is less than StepTolerance, or when it computes that the norm of the gradient at the current point is less than GradientTolerance times a scalar τ, where

$τ = \max (1, \min (| f |, {‖ g_{0} ‖}_{\infty})) .$

|f| is the norm of the objective function, and ${‖ g_{0} ‖}_{\infty}$ is the infinity norm of the initial gradient.

The objective function attempts to obtain a nearly orthonormal weight matrix that minimizes the sum of elements of g(XW), where g is a function (described below) that is applied elementwise to XW. To understand how the objective function attempts to achieve these goals, see Le, Karpenko, Ngiam, and Ng [3].

After constructing a ReconstructionICA object, use the transform method to map input data to the new output features.

Reconstruction ICA Objective Function

The objective function uses a contrast function, which you specify by using the ContrastFcn name-value pair. The contrast function is a smooth convex function that is similar to an absolute value. By default, the contrast function is $g = \frac{1}{2} \log (\cosh (2 x))$ . For other available contrast functions, see ContrastFcn.

For an n-by-p data matrix X and q output features, with a regularization parameter λ as the value of the Lambda name-value pair, the objective function in terms of the p-by-q matrix W is

$h = \frac{λ}{n} \sum_{i = 1}^{n} {‖ W W^{T} x_{i} - x_{i} ‖}_{2}^{2} + \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{q} σ_{j} g (w_{j}^{T} x_{i})$

The σ_j are known constants that are ±1. When σ_j = +1, minimizing the objective function h encourages the histogram of $w_{j}^{T} x_{i}$ to be sharply peaked at 0 (super Gaussian). When σ_j = –1, minimizing the objective function h encourages the histogram of $w_{j}^{T} x_{i}$ to be flatter near 0 (sub Gaussian). Specify the σ_j values using the rica NonGaussianityIndicator name-value pair.

The objective function h can have a spurious minimum of zero when λ is zero. Therefore, rica minimizes h over W that are normalized to 1. In other words, each column w_j of W is defined in terms of a column vector v_j by

$w_{j} = \frac{v_{j}}{\sqrt{v_{j}^{T} v_{j} + 10^{- 8}}} .$

rica minimizes over the v_j. The resulting minimal matrix W provides the transformation from input data X to output features XW.

References

[1] Ngiam, Jiquan, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, and Andrew Y. Ng. “Sparse Filtering.” Advances in Neural Information Processing Systems. Vol. 24, 2011, pp. 1125–1133. https://papers.nips.cc/paper/4334-sparse-filtering.pdf.

[2] Nocedal, J. and S. J. Wright. Numerical Optimization, Second Edition. Springer Series in Operations Research, Springer Verlag, 2006.

[3] Le, Quoc V., Alexandre Karpenko, Jiquan Ngiam, and Andrew Y. Ng. “ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning.” Advances in Neural Information Processing Systems. Vol. 24, 2011, pp. 1017–1025. https://papers.nips.cc/paper/4467-ica-with-reconstruction-cost-for-efficient-overcomplete-feature-learning.pdf.