Why is SVM performance with small random datasets so high?
Show older comments
To understand more how SVMs work, I am training a binary SVM with the function fitcsvm, using a sample data set of completely random numbers and cross-validating the classifier with a 10-fold cross-validation.
Since the dataset consists of random numbers, I would expect the classification accuracy of the trained cross-validated SVM to be around 50%.
However, with small datasets, for example consisting of 2 predictors and 12 observations (6 per class), I get very high classification accuracy, up to about 75%. Classification accuracy gets close to 50% by increasing the dataset, for example 2 predictors and 60 observations or 40 predictors and 12 observations. Why with small datasets is the classification accuracy so high?
I guess that with small datasets you might more easily go into over-fitting. Is this the case here?
Anyway, with cross-validation, the SVM is recursively trained on nine partitions and tested on the tenth. Even if the dataset is small, I would anyway expect an accuracy of around 50%, simply because the tenth partition is made of random numbers. Does the cross-validation perform some optimization of the model parameters?
The code that I am using is something like the following, where I try 100 different combinations of Kernel Scale and Box Constraint and then take the combination that yields the lowest classification error:
SVMModel = fitcsvm(cdata, label, 'KernelFunction','linear', 'Standardize',true,...
'KernelScale',KS,'BoxConstraint',BC,...
'CrossVal','on','KFold',10);
MisclassRate = kfoldLoss(SVMModel);
I would very much appreciate any clarification. Many thanks!
Accepted Answer
More Answers (1)
Ilya
on 31 Jan 2017
You have 12 observations. For each observation, the probability of correct classification is 0.5. What is the probability of classifying 9 or more observations correctly by chance? It's
>> p = binocdf(8,12,0.5,'upper')
p =
0.0730
And what is the probability of that chance event occurring at least once in 100 experiments? It's
>> binocdf(0,100,p,'upper')
ans =
0.9995
Since you take the most accurate model, you always get a highly optimistic estimate of accuracy, that's all.
5 Comments
Alessandro La Chioma
on 1 Feb 2017
Ilya
on 1 Feb 2017
Sorry, I did not understand what significance you are talking about.
There is nothing wrong with selecting the best model over many parameter values. What you should not do is quote the accuracy used to select the best set of parameters as the model accuracy. If you have two models with cross-validated accuracies a1 and a2, a1<a2, and you choose model 2 because a2 is larger, the generalization accuracy of that model is not a2. a2 is biased high because you have preferred the model with a higher estimate. You need to apply that model to a new dataset and quote accuracy obtained on that dataset for the model accuracy.
Alessandro La Chioma
on 2 Feb 2017
Ilya
on 3 Feb 2017
If I understand, you take the cross-validation accuracy for the best model (the same accuracy that was used to identify the best model) and then compare that accuracy with a distribution obtained for noise (randomly permuted labels). If that's what you do, your procedure is incorrect. It always produces an estimate of model performance (accuracy, significance, whatever you call it) that is optimistically biased.
You select the model with the highest accuracy, but you do not know if this value is high by chance or because the model is really good. Then you take that high value and compare it with a distribution of noise. If the accuracy value is high, it naturally gets into the tail of the noise distribution. But that does not prove that the model is really good. This only shows that the accuracy value is high, which is what you established in the first place.
Dealing with small datasets is tough and usually requires domain knowledge. Maybe you can generate synthetic data by adding some noise to the predictors. Maybe, despite what you think, you can set off a fraction of the dataset for testing.
Alessandro La Chioma
on 21 Feb 2017
Categories
Find more on Get Started with Statistics and Machine Learning Toolbox in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!