Non-reproducible "fitcsvm" Matlab output
1 view (last 30 days)
Show older comments
load ionosphere
% run number 1
rng(1); % For reproducibility
SVMModel = fitcsvm(X,Y,'Standardize',true,'KernelFunction','linear','CacheSize','maximal','Solver','L1QP','KernelScale','auto');
% run number 2
indperm = randperm(size(X,1))';
X=X(indperm,:);
Y=Y(indperm);
SVMModel2 = fitcsvm(X,Y,'Standardize',true,'KernelFunction','linear','CacheSize','maximal','Solver','L1QP','KernelScale','auto');
SVMModel1 and SVMModel2 are different (bias and kernel scale values), just by varying the row-sorting of input data matrix X and Y. Any idea on what's going on?
thanks for help
4 Comments
Rik
on 9 Jun 2023
I'm not sure you fully understand what rng(1) does (or I'm misunderstanding you).
What it does is to set the state of the randomizer, making sure that the output from any random function are deterministic (though still random). An example will help:
rng(1)
A = randi(20,1);
rng(1)
B = randi(20,1);
C = randi(20,1);
% A should now be equal to B, but C may be different
A,B,C
So there are two reasons why the output is not the same, despite calling rng: you have already called random functions (which advances the seed), and you are changing the input (which could affect the results).
For an example of the latter: I don't know how the internals work, but for the concept that doesn't matter anyway.
rng(1)
data = 5*rand(2000,1);
indperm = randperm(size(data,1))';
SuperFancyMachineLearningMean(data)-mean(data)
SuperFancyMachineLearningMean(data(indperm))-mean(data)
function output = SuperFancyMachineLearningMean(data);
% Calculate (well, approximate, actually) the mean of a vector.
% Split the data in N blocks.
N = min(numel(data),10);
D1 = repmat(ceil(numel(data)/N),1,N);
D2 = 1;
D1(end) = numel(data)-sum(D1(1:(end-1))); % make the last smaller to fit element count
d = mat2cell(reshape(data,[],1),D1,D2);
for n=1:numel(d)
d{n} = mean(d{n});
end
output = mean([d{:}]);
end
This is apperently not as bad as an example as I thought (unless you're working with very small numbers, but the idea carries over.
Accepted Answer
Rik
on 9 Jun 2023
I'm not familiar with the internals of what this does exactly, but is this truly unexpected?
Since this is a form of fitting your data to a function, some variation is expected. For small fitting problems you can use the entire dataset in one go, meaning that sorting may or may not affect the result, but with machine learning this is generally not feasible. That means that the order of your samples may affect the training result.
2 Comments
Rik
on 9 Jun 2023
Would you still expect the code to sort the data internally in some way if we're talking terabytes of data? Because that is essentially what you're asking. Note that I'm not defending the current implementation of this function, I'm merely explaining why I'm not surprised that there are functions in the stats&ML toolbox for which this happens.
This is essentially the same problem when you make splits for cross-validation: the splits may determine the outcome (I don't recall whether my colleague published this, so you will have to look for it yourself if you want to see a paper). While it is true that small changes in the data may explode when extrapolating, that is not unique to systems that depend on the data input order. Every extrapollation runs this risk.
More Answers (0)
See Also
Categories
Find more on Classification Trees in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!