# fsrftest

Univariate feature ranking for regression using *F*-tests

## Syntax

## Description

ranks features (predictors) using `idx`

= fsrftest(`Tbl`

,`ResponseVarName`

)*F*-tests. The table `Tbl`

contains predictor variables and a response variable, and `ResponseVarName`

is the name of the response variable in `Tbl`

. The function returns `idx`

, which contains the indices of predictors ordered by predictor importance, meaning `idx(1)`

is the index of the most important predictor. You can use `idx`

to select important predictors for regression problems.

specifies additional options using one or more name-value pair arguments in addition to any of the input argument combinations in the previous syntaxes. For example, you can specify categorical predictors and observation weights.`idx`

= fsrftest(___,`Name,Value`

)

## Examples

### Rank Predictors in Matrix

Rank predictors in a numeric matrix and create a bar plot of predictor importance scores.

Load the sample data.

`load robotarm.mat`

The `robotarm`

data set contains 7168 training observations (`Xtrain`

and `ytrain`

) and 1024 test observations (`Xtest`

and `ytest`

) with 32 features [1][2].

Rank the predictors using the training observations.

[idx,scores] = fsrftest(Xtrain,ytrain);

The values in `scores`

are the negative logs of the *p*-values. If a *p*-value is smaller than `eps(0)`

, then the corresponding score value is `Inf`

. Before creating a bar plot, determine whether `scores`

includes `Inf`

values.

find(isinf(scores))

ans = 1x0 empty double row vector

`scores`

does not include `Inf`

values. If `scores`

includes `Inf`

values, you can replace `Inf`

by a large numeric number before creating a bar plot for visualization purposes. For details, see Rank Predictors in Table.

Create a bar plot of the predictor importance scores.

bar(scores(idx)) xlabel('Predictor rank') ylabel('Predictor importance score')

Select the top five most important predictors. Find the columns of these predictors in `Xtrain`

.

idx(1:5)

`ans = `*1×5*
30 24 10 4 5

The 30th column of `Xtrain`

is the most important predictor of `ytrain`

.

### Rank Predictors in Table

Rank predictors in a table and create a bar plot of predictor importance scores.

If your data is in a table and `fsrftest`

ranks a subset of the variables in the table, then the function indexes the variables using only the subset. Therefore, a good practice is to move the predictors that you do not want to rank to the end of the table. Move the response variable and observation weight vector as well. Then, the indexes of the output arguments are consistent with the indexes of the table. You can move variables in a table using the `movevars`

function.

This example uses the Abalone data [3][4] from the UCI Machine Learning Repository [5].

Download the data and save it in your current folder with the name `'abalone.csv'`

.

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'; websave('abalone.csv',url);

Read the data in a table.

tbl = readtable('abalone.csv','Filetype','text','ReadVariableNames',false); tbl.Properties.VariableNames = {'Sex','Length','Diameter','Height', ... 'WWeight','SWeight','VWeight','ShWeight','NoShellRings'};

Preview the first few rows of the table.

head(tbl)

`ans=`*8×9 table*
Sex Length Diameter Height WWeight SWeight VWeight ShWeight NoShellRings
_____ ______ ________ ______ _______ _______ _______ ________ ____________
{'M'} 0.455 0.365 0.095 0.514 0.2245 0.101 0.15 15
{'M'} 0.35 0.265 0.09 0.2255 0.0995 0.0485 0.07 7
{'F'} 0.53 0.42 0.135 0.677 0.2565 0.1415 0.21 9
{'M'} 0.44 0.365 0.125 0.516 0.2155 0.114 0.155 10
{'I'} 0.33 0.255 0.08 0.205 0.0895 0.0395 0.055 7
{'I'} 0.425 0.3 0.095 0.3515 0.141 0.0775 0.12 8
{'F'} 0.53 0.415 0.15 0.7775 0.237 0.1415 0.33 20
{'F'} 0.545 0.425 0.125 0.768 0.294 0.1495 0.26 16

The last variable in the table is a response variable.

Rank the predictors in `tbl`

. Specify the last column `NoShellRings`

as a response variable.

`[idx,scores] = fsrftest(tbl,'NoShellRings')`

`idx = `*1×8*
3 4 5 7 8 2 6 1

`scores = `*1×8*
447.6891 736.9619 Inf Inf Inf 604.6692 Inf Inf

The values in `scores`

are the negative logs of the *p*-values. If a *p*-value is smaller than `eps(0)`

, then the corresponding score value is `Inf`

. Before creating a bar plot, determine whether `scores`

includes `Inf`

values.

idxInf = find(isinf(scores))

`idxInf = `*1×5*
3 4 5 7 8

`scores`

includes five `Inf`

values.

Create a bar plot of predictor importance scores. Use the predictor names for the *x*-axis tick labels.

bar(scores(idx)) xlabel('Predictor rank') ylabel('Predictor importance score') xticklabels(strrep(tbl.Properties.VariableNames(idx),'_','\_')) xtickangle(45)

The `bar`

function does not plot any bars for the `Inf`

values. For the `Inf`

values, plot bars that have the same length as the largest finite score.

hold on bar(scores(idx(length(idxInf)+1))*ones(length(idxInf),1)) legend('Finite Scores','Inf Scores') hold off

The bar graph displays finite scores and Inf scores using different colors.

## Input Arguments

`Tbl`

— Sample data

table

Sample data, specified as a table. Multicolumn variables and cell arrays other than cell arrays of character vectors are not allowed.

Each row of `Tbl`

corresponds to one observation, and each column corresponds to one predictor variable. Optionally, `Tbl`

can contain additional columns for a response variable and observation weights. The response variable must be a numeric vector.

If

`Tbl`

contains the response variable, and you want to use all remaining variables in`Tbl`

as predictors, then specify the response variable by using`ResponseVarName`

. If`Tbl`

also contains the observation weights, then you can specify the weights by using`Weights`

.If

`Tbl`

contains the response variable, and you want to use only a subset of the remaining variables in`Tbl`

as predictors, then specify the subset of variables by using`formula`

.If

`Tbl`

does not contain the response variable, then specify a response variable by using`Y`

. The response variable and`Tbl`

must have the same number of rows.

If `fsrftest`

uses a subset of variables in `Tbl`

as predictors, then the function indexes the predictors using only the subset. The values in the `CategoricalPredictors`

name-value argument and the output argument `idx`

do not count the predictors that the function does not rank.

If `Tbl`

contains a response variable, then `fsrftest`

considers `NaN`

values in the response variable to be missing values. `fsrftest`

does not use observations with missing values in the response variable.

**Data Types: **`table`

`ResponseVarName`

— Response variable name

character vector or string scalar containing name of variable in
`Tbl`

Response variable name, specified as a character vector or string scalar containing the name of a variable in `Tbl`

.

For example, if a response variable is the column `Y`

of
`Tbl`

(`Tbl.Y`

), then specify
`ResponseVarName`

as `"Y"`

.

**Data Types: **`char`

| `string`

`formula`

— Explanatory model of response variable and subset of predictor variables

character vector | string scalar

Explanatory model of the response variable and a subset of the predictor variables, specified
as a character vector or string scalar in the form ```
"Y ~ x1 + x2 +
x3"
```

. In this form, `Y`

represents the response variable, and
`x1`

, `x2`

, and `x3`

represent
the predictor variables.

To specify a subset of variables in `Tbl`

as predictors, use a formula. If
you specify a formula, then `fsrftest`

does not rank any variables
in `Tbl`

that do not appear in `formula`

.

The variable names in the formula must be both variable names in
`Tbl`

(`Tbl.Properties.VariableNames`

) and valid
MATLAB^{®} identifiers. You can verify the variable names in `Tbl`

by using the `isvarname`

function. If the variable
names are not valid, then you can convert them by using the `matlab.lang.makeValidName`

function.

**Data Types: **`char`

| `string`

`Y`

— Response variable

numeric vector

`X`

— Predictor data

numeric matrix

Predictor data, specified as a numeric matrix. Each row of `X`

corresponds to one observation, and each column corresponds to one predictor variable.

**Data Types: **`single`

| `double`

### Name-Value Arguments

Specify optional pairs of arguments as
`Name1=Value1,...,NameN=ValueN`

, where `Name`

is
the argument name and `Value`

is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.

*
Before R2021a, use commas to separate each name and value, and enclose*
`Name`

*in quotes.*

**Example: **`'NumBins',20,'UseMissing',true`

sets the number of bins as 20 and specifies to use missing values in predictors for ranking.

`CategoricalPredictors`

— List of categorical predictors

vector of positive integers | logical vector | character matrix | string array | cell array of character vectors | `"all"`

List of categorical predictors, specified as one of the values in this table.

Value | Description |
---|---|

Vector of positive integers |
Each entry in the vector is an index value indicating that the corresponding predictor is
categorical. The index values are between 1 and If |

Logical vector |
A |

Character matrix | Each row of the matrix is the name of a predictor variable. The
names must match the names in `Tbl` . Pad the
names with extra blanks so each row of the character matrix has the
same length. |

String array or cell array of character vectors | Each element in the array is the name of a predictor variable.
The names must match the names in `Tbl` . |

`"all"` | All predictors are categorical. |

By default, if the predictor data is in a table
(`Tbl`

), `fsrftest`

assumes that a variable is
categorical if it is a logical vector, unordered categorical vector, character array, string
array, or cell array of character vectors. If the predictor data is a matrix
(`X`

), `fsrftest`

assumes that all predictors are
continuous. To identify any other predictors as categorical predictors, specify them by using
the `CategoricalPredictors`

name-value argument.

**Example: **`"CategoricalPredictors","all"`

**Example: **`CategoricalPredictors=[1 5 6 8]`

**Data Types: **`single`

| `double`

| `logical`

| `char`

| `string`

| `cell`

`NumBins`

— Number of bins for binning continuous predictors

10 (default) | positive integer scalar

Number of bins for binning continuous predictors, specified as the comma-separated pair consisting of `'NumBins'`

and a positive integer scalar.

**Example: **`'NumBins',50`

**Data Types: **`single`

| `double`

`UseMissing`

— Indicator for whether to use or discard missing values in predictors

`false`

(default) | `true`

Indicator for whether to use or discard missing values in predictors, specified as the
comma-separated pair consisting of `'UseMissing'`

and either
`true`

to use or `false`

to discard missing values
in predictors for ranking.

`fsrftest`

considers `NaN`

,
`''`

(empty character vector), `""`

(empty
string), `<missing>`

, and `<undefined>`

values to be missing values.

If you specify `'UseMissing',true`

, then
`fsrftest`

uses missing values for ranking. For a categorical
variable, `fsrftest`

treats missing values as an extra category.
For a continuous variable, `fsrftest`

places
`NaN`

values in a separate bin for binning.

If you specify `'UseMissing',false`

, then
`fsrftest`

does not use missing values for ranking. Because
`fsrftest`

computes importance scores individually for each
predictor, the function does not discard an entire row when values in the row are
partially missing. For each variable, `fsrftest`

uses all values
that are not missing.

**Example: **`'UseMissing',true`

**Data Types: **`logical`

`Weights`

— Observation weights

`ones(size(X,1),1)`

(default) | vector of scalar values | name of variable in `Tbl`

Observation weights, specified as the comma-separated pair consisting of `'Weights'`

and a vector of scalar values or the name of a variable in `Tbl`

. The function weights the observations in each row of `X`

or `Tbl`

with the corresponding value in `Weights`

. The size of `Weights`

must equal the number of rows in `X`

or `Tbl`

.

If you specify the input data as a table `Tbl`

, then `Weights`

can be the name of a variable in `Tbl`

that contains a numeric vector. In this case, you must specify `Weights`

as a character vector or string scalar. For example, if the weight vector is the column `W`

of `Tbl`

(`Tbl.W`

), then specify `'Weights','W'`

.

`fsrftest`

normalizes the weights to add up to one.

**Data Types: **`single`

| `double`

| `char`

| `string`

## Output Arguments

`idx`

— Indices of predictors ordered by predictor importance

numeric vector

Indices of predictors in `X`

or `Tbl`

ordered by
predictor importance, returned as a 1-by-*r* numeric vector, where
*r* is the number of ranked predictors.

If `fsrftest`

uses a subset of variables in `Tbl`

as
predictors, then the function indexes the predictors using only the subset. For example,
suppose `Tbl`

includes 10 columns and you specify the last five
columns of `Tbl`

as the predictor variables by using
`formula`

. If `idx(3)`

is `5`

,
then the third most important predictor is the 10th column in `Tbl`

,
which is the fifth predictor in the subset.

`scores`

— Predictor scores

numeric vector

Predictor scores, returned as a 1-by-*r* numeric vector, where
*r* is the number of ranked predictors.

A large score value indicates that the corresponding predictor is important.

For example, suppose `Tbl`

includes 10 columns and you specify the last
five columns of `Tbl`

as the predictor variables by using
`formula`

. Then, `score(3)`

contains the score
value of the 8th column in `Tbl`

, which is the third predictor in the
subset.

## Algorithms

### Univariate Feature Ranking Using *F*-Tests

`fsrftest`

examines the importance of each predictor individually using an*F*-test. Each*F*-test tests the hypothesis that the response values grouped by predictor variable values are drawn from populations with the same mean against the alternative hypothesis that the population means are not all the same. A small*p*-value of the test statistic indicates that the corresponding predictor is important.The output

`scores`

is –log(*p*). Therefore, a large score value indicates that the corresponding predictor is important. If a*p*-value is smaller than`eps(0)`

, then the output is`Inf`

.`fsrftest`

examines a continuous variable after binning, or discretizing, the variable. You can specify the number of bins using the`'NumBins'`

name-value pair argument.

## References

[1] Rasmussen, C. E., R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE Manual, 1996.

[2] University of Toronto, Computer Science Department. Delve Datasets.

[3] Nash, W.J., T. L. Sellers, S. R. Talbot, A. J. Cawthorn, and W. B. Ford. "The Population Biology of Abalone (*Haliotis* species) in Tasmania. I. Blacklip Abalone (*H. rubra*) from the North Coast and Islands of Bass Strait." Sea Fisheries Division, Technical Report No. 48, 1994.

[4] Waugh, S. "Extending and Benchmarking Cascade-Correlation: Extensions to the Cascade-Correlation Architecture and Benchmarking of Feed-forward Supervised Artificial Neural Networks." *University of Tasmania Department of Computer Science thesis*, 1995.

[5] Lichman, M. UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science, 2013. http://archive.ics.uci.edu/ml.

## Version History

**Introduced in R2020a**

## Open Example

You have a modified version of this example. Do you want to open this example with your edits?

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

# Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)