screenpredictors
Screen credit scorecard predictors for predictive value
Description
returns the output variable, metric_table
= screenpredictors(data
)metric_table
, a MATLAB® table containing the calculated values for several measures of
predictive power for each predictor variable in the data
.
Use the screenpredictors
function as a preprocessing step
in the Credit Scorecard Modeling Workflow to
reduce the number of predictor variables before you create the credit scorecard
using the creditscorecard
function from
Financial Toolbox™. In addition, you can use Threshold
Predictors from Risk Management Toolbox™to interactively set credit scorecard predictor thresholds using the
output from screenpredictors
before you create the credit scorecard using the creditscorecard
.
specifies options using one or more name-value pair arguments in addition to the
input arguments in the previous syntax. metric_table
= screenpredictors(___,Name,Value
)
Examples
Screen Predictors for a creditscorecard
Object
Reduce the number of predictor variables by screening predictors before you create a credit scorecard.
Use the CreditCardData.mat
file to load the data (using a dataset from Refaat 2011).
load CreditCardData.mat
Define 'IDVar'
and 'ResponseVar'
.
idvar = 'CustID'; responsevar = 'status';
Use screenpredictors
to calculate the predictor screening metrics. The function returns a table containing the metrics values. Each table row corresponds to a predictor from the input table data.
metric_table = screenpredictors(data,'IDVar', idvar,'ResponseVar', responsevar)
metric_table=9×7 table
InfoValue AccuracyRatio AUROC Entropy Gini Chi2PValue PercentMissing
_________ _____________ _______ _______ _______ __________ ______________
CustAge 0.18863 0.17095 0.58547 0.88729 0.42626 0.00074524 0
TmWBank 0.15719 0.13612 0.56806 0.89167 0.42864 0.0054591 0
CustIncome 0.15572 0.17758 0.58879 0.891 0.42731 0.0018428 0
TmAtAddress 0.094574 0.010421 0.50521 0.90089 0.43377 0.182 0
UtilRate 0.075086 0.035914 0.51796 0.90405 0.43575 0.45546 0
AMBalance 0.07159 0.087142 0.54357 0.90446 0.43592 0.48528 0
EmpStatus 0.048038 0.10886 0.55443 0.90814 0.4381 0.00037823 0
OtherCC 0.014301 0.044459 0.52223 0.91347 0.44132 0.047616 0
ResStatus 0.0097738 0.05039 0.5252 0.91422 0.44182 0.27875 0
metric_table = sortrows(metric_table,'AccuracyRatio','descend')
metric_table=9×7 table
InfoValue AccuracyRatio AUROC Entropy Gini Chi2PValue PercentMissing
_________ _____________ _______ _______ _______ __________ ______________
CustIncome 0.15572 0.17758 0.58879 0.891 0.42731 0.0018428 0
CustAge 0.18863 0.17095 0.58547 0.88729 0.42626 0.00074524 0
TmWBank 0.15719 0.13612 0.56806 0.89167 0.42864 0.0054591 0
EmpStatus 0.048038 0.10886 0.55443 0.90814 0.4381 0.00037823 0
AMBalance 0.07159 0.087142 0.54357 0.90446 0.43592 0.48528 0
ResStatus 0.0097738 0.05039 0.5252 0.91422 0.44182 0.27875 0
OtherCC 0.014301 0.044459 0.52223 0.91347 0.44132 0.047616 0
UtilRate 0.075086 0.035914 0.51796 0.90405 0.43575 0.45546 0
TmAtAddress 0.094574 0.010421 0.50521 0.90089 0.43377 0.182 0
Based on the AccuracyRatio
metric, select the top predictors to use when you create the creditscorecard
object.
varlist = metric_table.Row(metric_table.AccuracyRatio > 0.09)
varlist = 4x1 cell
{'CustIncome'}
{'CustAge' }
{'TmWBank' }
{'EmpStatus' }
Use creditscorecard
to create a createscorecard
object based on only the "screened" predictors.
sc = creditscorecard(data,'IDVar', idvar,'ResponseVar', responsevar, 'PredictorVars', varlist)
sc = creditscorecard with properties: GoodLabel: 0 ResponseVar: 'status' WeightsVar: '' VarNames: {'CustID' 'CustAge' 'TmAtAddress' 'ResStatus' 'EmpStatus' 'CustIncome' 'TmWBank' 'OtherCC' 'AMBalance' 'UtilRate' 'status'} NumericPredictors: {'CustAge' 'CustIncome' 'TmWBank'} CategoricalPredictors: {'EmpStatus'} BinMissingData: 0 IDVar: 'CustID' PredictorVars: {'CustAge' 'EmpStatus' 'CustIncome' 'TmWBank'} Data: [1200x11 table]
Input Arguments
data
— Data for creditscorecard
object
table | tall table | tall timetable
Data for the creditscorecard
object, specified as a
MATLAB table, tall table, or tall timetable, where each column of
data can be any one of the following data types:
Numeric
Logical
Cell array of character vectors
Character array
Categorical
String
Data Types: table
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: metric_table =
screenpredictors(data,'IDVar','CustAge','ResponseVar','status','PredictorVars',{'CustID','CustIncome'})
IDVar
— Name of identifier variable
''
(default) | character vector
Name of identifier variable, specified as the comma-separated pair
consisting of 'IDVar'
and a case-sensitive
character vector. The 'IDVar'
data can be ordinal
numbers or Social Security numbers. By specifying
'IDVar'
, you can omit the identifier variable
from the predictor variables easily.
Data Types: char
ResponseVar
— Response variable name
last column of the data
input (default) | character vector
Response variable name, specified as the comma-separated pair
consisting of 'ResponseVar'
and a case-sensitive
character vector. The response variable data must be binary, the
"Good"
or "Bad"
indicator.
If not specified, 'ResponseVar'
is set to the last
column of the input data
by default.
Data Types: char
PredictorVars
— Names of predictor variables
set difference between VarNames
and {
IDVar
,ResponseVar
}
(default) | cell array of character vectors | string array
Names of predictor variables, specified as the comma-separated
pair consisting of 'PredictorVars'
and a
case-sensitive cell array of character vectors or string array. By
default, when you create a creditscorecard
object, all variables are predictors except for
IDVar
and ResponseVar
.
Any name you specify using 'PredictorVars'
must
differ from the IDVar
and
ResponseVar
names.
Data Types: cell
| string
WeightsVar
— Name of weights variable
''
(default) | character vector
Name of weights variable, specified as the comma-separated pair
consisting of 'WeightsVar'
and a case-sensitive
character vector to indicate which column name in the
data
table contains the row weights.
If you do not specify 'WeightsVar'
when you
create a creditscorecard
object, then the
function uses the unit weights as the observation weights.
Data Types: char
NumBins
— Number of (equal frequency) bins for numeric predictors
20
(default) | scalar numeric
Number of (equal frequency) bins for numeric predictors, specified
as the comma-separated pair consisting of
'NumBins'
and a scalar numeric.
Data Types: double
FrequencyShift
— Indicates small shift in frequency tables that contain zero entries
0.5
(default) | scalar numeric between 0
and
1
Small shift in frequency tables that contain zero entries,
specified as the comma-separated pair consisting of
'FrequencyShift'
and a scalar numeric with a
value between 0
and 1
.
If the frequency table of a predictor contains any "pure" bins
(containing all goods or all bads) after you bin the data using
autobinning
, then
the function adds the 'FrequencyShift'
value to
all bins in the table. To avoid any perturbation, set
'FrequencyShift'
to
0
.
Data Types: double
Output Arguments
metric_table
— Calculated values for predictor screening metrics
table
Calculated values for the predictor screening metrics, returned as table. Each table row corresponds to a predictor from the input table data. The table columns contain calculated values for the following metrics:
'InfoValue'
— Information value. This metric measures the strength of a predictor in the fitting model by determining the deviation between the distributions of"Goods"
and"Bads"
.'AccuracyRatio'
— Accuracy ratio.'AUROC'
— Area under the ROC curve.'Entropy'
— Entropy. This metric measures the level of unpredictability in the bins. You can use the entropy metric to validate a risk model.'Gini'
— Gini. This metric measures the statistical dispersion or inequality within a sample of data.'Chi2PValue'
— Chi-square p-value. This metric is computed from the chi-square metric and is a measure of the statistical difference and independence between groups.'PercentMissing'
— Percentage of missing values in the predictor. This metric is expressed in decimal form.
Extended Capabilities
Tall Arrays
Calculate with arrays that have more rows than fit in memory.
This function supports input data
that is specified as a
tall column vector, a tall table, or a tall timetable. Note that the output for
numeric predictors might be slightly different when using a tall array.
Categorical predictors return the same results for tables and tall arrays. For
more information, see tall
and Tall Arrays.
Version History
Introduced in R2019a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: United States.
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)