## Feature Extraction

### What Is Feature Extraction?

Feature extraction is a set of methods that map input features to new output features. Many feature extraction methods use unsupervised learning to extract features. Unlike some feature extraction methods such as PCA and NNMF, the methods described in this section can increase dimensionality (and decrease dimensionality). Internally, the methods involve optimizing nonlinear objective functions. For details, see Sparse Filtering Algorithm or Reconstruction ICA Algorithm.

One typical use of feature extraction is finding features in images. Using these features can lead to improved classification accuracy. For an example, see Feature Extraction Workflow. Another typical use is extracting individual signals from superpositions, which is often termed blind source separation. For an example, see Extract Mixed Signals.

There are two feature extraction functions: `rica` and `sparsefilt`. Associated with these functions are the objects that they create: `ReconstructionICA` and `SparseFiltering`.

### Sparse Filtering Algorithm

The sparse filtering algorithm begins with a data matrix `X` that has `n` rows and `p` columns. Each row represents one observation and each column represents one measurement. The columns are also called the features or predictors. The algorithm then takes either an initial random `p`-by-`q` weight matrix `W` or uses the weight matrix passed in the `InitialTransformWeights` name-value pair. `q` is the requested number of features that `sparsefilt` computes.

The algorithm attempts to minimize the Sparse Filtering Objective Function by using a standard limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton optimizer. See Nocedal and Wright [2]. This optimizer takes up to `IterationLimit` iterations. It stops iterating earlier when it takes a step whose norm is less than `StepTolerance`, or when it computes that the norm of the gradient at the current point is less than `GradientTolerance` times a scalar τ, where

`$\tau =\mathrm{max}\left(1,\mathrm{min}\left(|f|,{‖{g}_{0}‖}_{\infty }\right)\right).$`

|f| is the norm of the objective function, and ${‖{g}_{0}‖}_{\infty }$ is the infinity norm of the initial gradient.

The objective function attempts to simultaneously obtain few nonzero features for each data point, and for each resulting feature to have nearly equal weight. To understand how the objective function attempts to achieve these goals, see Ngiam, Koh, Chen, Bhaskar, and Ng [1].

Frequently, you obtain good features by setting a relatively small value of `IterationLimit`, from as low as 5 to a few hundred. Allowing the optimizer to continue can result in overtraining, where the extracted features do not generalize well to new data.

After constructing a `SparseFiltering` object, use the `transform` method to map input data to the new output features.

#### Sparse Filtering Objective Function

To compute an objective function, the sparse filtering algorithm uses the following steps. The objective function depends on the `n`-by-`p` data matrix `X` and a weight matrix `W` that the optimizer varies. The weight matrix `W` has dimensions `p`-by-`q`, where `p` is the number of original features and `q` is the number of requested features.

1. Compute the `n`-by-`q` matrix `X*W`. Apply the approximate absolute value function $\varphi \left(u\right)=\sqrt{{u}^{2}+{10}^{-8}}$ to each element of `X*W` to obtain the matrix `F`. ϕ is a smooth nonnegative symmetric function that closely approximates the absolute value function.

2. Normalize the columns of `F` by the approximate L2 norm. In other words, define the normalized matrix $\stackrel{˜}{F}\left(i,j\right)$ by

`$\begin{array}{c}‖F\left(j\right)‖=\sqrt{\sum _{i=1}^{n}{\left(F\left(i,j\right)\right)}^{2}+{10}^{-8}}\\ \stackrel{˜}{F}\left(i,j\right)=F\left(i,j\right)/‖F\left(j\right)‖.\end{array}$`
3. Normalize the rows of $\stackrel{˜}{F}\left(i,j\right)$ by the approximate L2 norm. In other words, define the normalized matrix $\stackrel{^}{F}\left(i,j\right)$ by

`$\begin{array}{c}‖\stackrel{˜}{F}\left(i\right)‖=\sqrt{\sum _{j=1}^{q}{\left(\stackrel{˜}{F}\left(i,j\right)\right)}^{2}+{10}^{-8}}\\ \stackrel{^}{F}\left(i,j\right)=\stackrel{˜}{F}\left(i,j\right)/‖\stackrel{˜}{F}\left(i\right)‖.\end{array}$`

The matrix $\stackrel{^}{F}$ is the matrix of converted features in `X`. Once `sparsefilt` finds the weights `W` that minimize the objective function h (see below), which the function stores in the output object `Mdl` in the `Mdl.TransformWeights` property, the `transform` function can follow the same transformation steps to convert new data to output features.

4. Compute the objective function h(`W`) as the 1–norm of the matrix $\stackrel{^}{F}\left(i,j\right)$, meaning the sum of all the elements in the matrix (which are nonnegative by construction):

`$h\left(W\right)=\sum _{j=1}^{q}\sum _{i=1}^{n}\stackrel{^}{F}\left(i,j\right).$`
5. If you set the `Lambda` name-value pair to a strictly positive value, `sparsefilt` uses the following modified objective function:

`$h\left(W\right)=\sum _{j=1}^{q}\sum _{i=1}^{n}\stackrel{^}{F}\left(i,j\right)+\lambda \sum _{j=1}^{q}{w}_{j}^{T}{w}_{j}.$`

Here, wj is the jth column of the matrix `W` and λ is the value of `Lambda`. The effect of this term is to shrink the weights `W`. If you plot the columns of `W` as images, with positive `Lambda` these images appear smooth compared to the same images with zero `Lambda`.

### Reconstruction ICA Algorithm

The Reconstruction Independent Component Analysis (RICA) algorithm is based on minimizing an objective function. The algorithm maps input data to output features.

The ICA source model is the following. Each observation x is generated by a random vector s according to

`$x=\mu +As.$`
• x is a column vector of length `p`.

• μ is a column vector of length `p` representing a constant term.

• s is a column vector of length `q` whose elements are zero mean, unit variance random variables that are statistically independent of each other.

• A is a mixing matrix of size `p`-by-`q`.

You can use this model in `rica` to estimate A from observations of x. See Extract Mixed Signals.

The RICA algorithm begins with a data matrix `X` that has `n` rows and `p` columns consisting of the observations xi:

`$X=\left[\begin{array}{c}{x}_{1}^{T}\\ {x}_{2}^{T}\\ ⋮\\ {x}_{n}^{T}\end{array}\right].$`

Each row represents one observation and each column represents one measurement. The columns are also called the features or predictors. The algorithm then takes either an initial random `p`-by-`q` weight matrix `W` or uses the weight matrix passed in the `InitialTransformWeights` name-value pair. `q` is the requested number of features that `rica` computes. The weight matrix `W` is composed of columns wi of size `p`-by-1:

`$W=\left[\begin{array}{cccc}{w}_{1}& {w}_{2}& \dots & {w}_{q}\end{array}\right].$`

The algorithm attempts to minimize the Reconstruction ICA Objective Function by using a standard limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton optimizer. See Nocedal and Wright [2]. This optimizer takes up to `IterationLimit` iterations. It stops iterating when it takes a step whose norm is less than `StepTolerance`, or when it computes that the norm of the gradient at the current point is less than `GradientTolerance` times a scalar τ, where

`$\tau =\mathrm{max}\left(1,\mathrm{min}\left(|f|,{‖{g}_{0}‖}_{\infty }\right)\right).$`

|f| is the norm of the objective function, and ${‖{g}_{0}‖}_{\infty }$ is the infinity norm of the initial gradient.

The objective function attempts to obtain a nearly orthonormal weight matrix that minimizes the sum of elements of g(`XW`), where g is a function (described below) that is applied elementwise to `XW`. To understand how the objective function attempts to achieve these goals, see Le, Karpenko, Ngiam, and Ng [3].

After constructing a `ReconstructionICA` object, use the `transform` method to map input data to the new output features.

#### Reconstruction ICA Objective Function

The objective function uses a contrast function, which you specify by using the `ContrastFcn` name-value pair. The contrast function is a smooth convex function that is similar to an absolute value. By default, the contrast function is $g=\frac{1}{2}\mathrm{log}\left(\mathrm{cosh}\left(2x\right)\right)$. For other available contrast functions, see `ContrastFcn`.

For an `n`-by-`p` data matrix `X` and `q` output features, with a regularization parameter λ as the value of the `Lambda` name-value pair, the objective function in terms of the `p`-by-`q` matrix `W` is

`$h=\frac{\lambda }{n}\sum _{i=1}^{n}{‖W{W}^{T}{x}_{i}-{x}_{i}‖}_{2}^{2}+\frac{1}{n}\sum _{i=1}^{n}\sum _{j=1}^{q}{\sigma }_{j}g\left({w}_{j}^{T}{x}_{i}\right)$`

The σj are known constants that are ±1. When σj = +1, minimizing the objective function h encourages the histogram of ${w}_{j}^{T}{x}_{i}$ to be sharply peaked at 0 (super Gaussian). When σj = –1, minimizing the objective function h encourages the histogram of ${w}_{j}^{T}{x}_{i}$ to be flatter near 0 (sub Gaussian). Specify the σj values using the `rica` `NonGaussianityIndicator` name-value pair.

The objective function h can have a spurious minimum of zero when λ is zero. Therefore, `rica` minimizes h over W that are normalized to 1. In other words, each column wj of W is defined in terms of a column vector vj by

`${w}_{j}=\frac{{v}_{j}}{\sqrt{{v}_{j}^{T}{v}_{j}+{10}^{-8}}}.$`

`rica` minimizes over the vj. The resulting minimal matrix `W` provides the transformation from input data `X` to output features `XW`.

## References

[1] Ngiam, Jiquan, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, and Andrew Y. Ng. “Sparse Filtering.” Advances in Neural Information Processing Systems. Vol. 24, 2011, pp. 1125–1133. `https://papers.nips.cc/paper/4334-sparse-filtering.pdf`.

[2] Nocedal, J. and S. J. Wright. Numerical Optimization, Second Edition. Springer Series in Operations Research, Springer Verlag, 2006.

[3] Le, Quoc V., Alexandre Karpenko, Jiquan Ngiam, and Andrew Y. Ng. “ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning.” Advances in Neural Information Processing Systems. Vol. 24, 2011, pp. 1017–1025. `https://papers.nips.cc/paper/4467-ica-with-reconstruction-cost-for-efficient-overcomplete-feature-learning.pdf`.