Do a leave one out cross-validation in patternnet

1 view (last 30 days)
Good Morning,
I have a particular set of data composed by data from inertial sensors to recognize a specific movement pattern.
This movement is made up of different actions and data are made up by several repetition of the same movement from different subjects.
I decided to use neural networks and in particular patternnet to solve the problem.
Now i decided to divide my dataset in this particuar way
  • 1 subject with all his exercises will be the test data while others with respective exercises will be training data.
  • 1 subject with all his exercises from training data will be the validation set while others will be training set.
This is done becouse the randomized division of training data during validation process will train the model on data that is someway similar to validation set (every subject do the exercise multiple time). Hence i thought about use a validation set as distinct as possible from the actual training set.
Now i tried to use divideblock to define indexes for all three different sets, but then i asked my self how can i iterate the process in order to do a leave one (subject) out cross validation, change indexes at every loop ? is this process done automatically by the train function ?
Thanks in advance,

Accepted Answer

Greg Heath
Greg Heath on 17 Dec 2018
Edited: Greg Heath on 17 Dec 2018
Over the past decades I have tried every cute data division technique known to man and beast.
BOTTOM LINE: The easiest sufficient technique is to perform multiple random data divisions.
There several basic decisions
1. The inherent dimensionality of the data
2. Dimensionality reduction ?
3. The relative size of the 3 subsets ( typically: 70/15/15)
4. The number of hidden layers ( typically: one)
5. The number of crossval folds ( typically: 0 OR 10 to 15 )
6. The number of repeat designs ( typically 0 to 5 )
Thank you for formally accepting my answer
Greg
  2 Comments
Mirko Job
Mirko Job on 18 Dec 2018
Dear Sir,
First of all thanks for your fast response and support. I apologize for taking so much time in answer your comment but i had to search around ANSWER and (comp.soft-sys.matlab) for your previous post in order to have an idea about optimization of nn design.
Relying on your answer and on example find on forums i will explain my doubts and my understandings in comments:
[ x, t ] = simplefit_dataset;
% 1-Talking about data dimensionality:
% My dataset is composed by
% x = [I N] = 252 18118
% y = [O N] = 5 18118
% 2 - PCA could be used as a tool for reducing dimensionality in I ?
MSE00 = mean(var(t',1))
% MSE00 is the constant output reference used to determine
% optimal hyperparameter combination
Hmin = 0
Hmax = 10
% For Hmax i can assume not to surpass Ntrneq = Ntrn*O. But it is no clear how
% i can calculate Hmin
dH = 1
Ntrials = 10
% Ntrials is arbitrary
j = 0
for h = Hmin:dH:Hmax
% Cicle over number of hidden neurons, basically i have to test the training several time
% with same number of nodes varying the weights and biases
j = j+1;
if h == 0
net = fitnet([]); % Linear Model
else
net = fitnet(h);
end
for i = 1: Ntrials
randstate(i,j)=rng;
% Is rng saved to keep the random state for weights and biases?
Hiddennodes = h
Trial = i
net = configure( net, x, t);
% Net initialization
net.divideFcn='dividerandom';
% 3 - I understand the importance of dividing data in random percentage
% But for my specific case i have a dataset made in this way
% SBJ1 EX1_Var1 EX1_Var2 ... EX1_VarN ACTIVITY1
% EX2_Var1 EX1_Var2 ... EX1_VarN ACTIVITY1
% EX3_Var1 EX1_Var2 ... EX1_VarN ACTIVITY1
% ....
% EXN_Var1 EX1_Var2 ... EX1_VarN ACTIVITY5
% SBJ2 EX1_Var1 EX1_Var2 ... EX1_VarN ACTIVITY1
% EX2_Var1 EX1_Var2 ... EX1_VarN ACTIVITY1
% EX3_Var1 EX1_Var2 ... EX1_VarN ACTIVITY1
% ....
% EXN_Var1 EX1_Var2 ... EX1_VarN ACTIVITY5
% So basically every subjects repeat multiple time the same exercise.
% For this reason if i divide randomly data in my validation process
% it is possible to validate over known variables. And from this reason i was
% thinking about the LOOCV taking out a subject with all his data
[net tr y ] = train(net, x, t);
stopcrit{i,j} = tr.stop;
% Is this the point where i can define the criteria to stop the validation?
bestepoch(i,j) = tr.best_epoch;
NMSE = mse(t-y)/MSE00;% Normalization
R2(i,j) = 1-NMSE;
%Y{i,j}=y;
WB{i,j}=getwb(net);
% HOW CAN I CHOOSE THE CORRECT PARAMETERS AFTER finishing all loops ?
% Should i look at R2 or bestepoch ?
end
end
Thanking in advance for your precious feedback, waiting for your kind response.
King Regards
Mirko Job
Mirko Job on 5 Feb 2019
Edited: Mirko Job on 8 Feb 2019
Dear Dr. Heath,
Before replying at your answer i spent some time looking around for your previous post about Neural Nets in Google Groups and MATLAB Answer. This allowed me to clarify some concept about NN implementation and optimization and this is the results i would like to present you and receive your helpful feedback. First a little reminder of my starting problem.
I have a dataset composed by XX subjects with multiple ripetitions of the same movement. The movement is characterized by O sub-activities in a definite sequence. So each subjects have N observation of O possible activities relying on I features coming from accelerometric signals. My first observation was that if i use a standard randomized separation of validation-test and training sets (since every subject have multiple trials), inside test and validation sets it is possible to find observations from a subject that is also present in the training set, basically biasing the results. So I came up with the solution of removing one subject for test and one for validation.
load(fullfile(cd,datasetfolder,'Table_Classifier.mat'));
dataI=table2array(DataSET(:,1:end-1)); % dataset without labels [N I]
dataO=table2array(DataSET(:,end)); % labels [N 1]
x=dataI'; %[I N] 252 18118
t=zeros(5,size(dataO,1)); %[O N] 5 18118
t(1,strcmp(dataO,'ACTIVITY1'))=1;t(2,strcmp(dataO,'ACTIVITY2'))=1;t(3,strcmp(dataO,'ACTIVITY3'))=1;
t(4,strcmp(dataO,'ACTIVITY4'))=1;t(5,strcmp(dataO,'ACTIVITY5'))=1;
Sbj=unique(DIVISOR); %Divisor contain all observation labeled with the subject ID
% DIMENSIONALITY OF THE DATASET
[I,N]=size(x); % 252 18118
[O,~]=size(t); % 5 18118
ITST= [find(contains(DIVISOR,Sbj{end}),1,'first');find(contains(DIVISOR,Sbj{end}),1,'last')]; % Test Indexes: The last suject with all his trials represent test set
Now my idea is to use a trial and error iteration and use each time a different subject as a validation set and see who give me best results. Here i have a doubt about the dimension of validation and test compared to the entire dataset, so my first question: should i consider a similar proportion to the default properties (70-15-15)? If yes another problem arise, all subjects can have different number of movement trials and inside of these, a different number of observations for each classes (since an observation is 100ms from accelerometer data, slower people have more data). Could be a possible solution repeating some rows on the dataset in order to have the same number of observation for each subject?
for v=1:size(Sbj,1)-1
IVAL=[find(contains(DIVISOR,Sbj{v}),1,'first');find(contains(DIVISOR,Sbj{v}),1,'last')]; % Validation Indexes: The v subject with all his trials are validation set
ITRN=find(or(contains(DIVISOR,Sbj{end}),contains(DIVISOR,Sbj{v}))==0); % Training Indexes: The remaining subjects are training set
MSE00a=mean(var(t(:,ITRN),0,2)); % 0.1210
Hdef=10; % Default number of hidden nodes
Ntrn=size(ITRN,1); % Training samples i.e. 16186
Hmax=floor((Ntrn-O)/(I+1+O)); % Upper bound for i.e. H 62
rng(0); % Initialize random state
j=0;
% ITERATION FOR Number of nodes
for h=Hdef:Hmax
Ntrneq=Ntrn*O; % 80930
Nw=(I+1)*h+(h+1)*O; % 6197
Ndof=Ntrneq-Nw; % 74733
MSEgoal=0.01*(Ndof/Ntrneq)*MSE00a; % training goal 0.0011
j=j+1;
net=patternnet(h,'trainscg'); % Definition of the net
net.divideFcn='divideind'; % Dataset division over indexes
net.divideParam.trainInd=ITRN;
net.divideParam.valInd=IVAL(1):IVAL(end);
net.divideParam.testInd=ITST(1):ITST(end);
net.trainParam.goal=MSEgoal;
net.performFcn = 'mse';
for i=1:10 % 10 sets of random weights
net=configure(net,x,t);
[net,tr,y,e]=train(net,x,t);
eval([Sbj{v},'stopcrit{i,j}=tr.stop;'])
eval([Sbj{v},'bestepoch(i,j)=tr.best_epoch;'])
eval([Sbj{v},'bestperf(i,j)=tr.best_perf;'])
end
end
end
Now thanks to the eval comand I have the results of every possible validation set in the training of the data changing number of hidden nodes and weights.
My second question is after i will found the correct model for NN how can I obtain the specific weights and bias from rng(0) ? Is it not the same saving them inside a cell variable and then setting them as classifier property ?
After this, i'm quite confused about my next steps and i would really appreciate your kind advice:
1- Should i used the formula of LOOCV:
Using the 3 performances matrixes saved for every subject used as validation set doing (like the title of the thread said) a sort of Leave One (Subject) Out Cross Validation. In this case should a save the best performance or the separate error on training validation and test using the confusion matrixes? What is the MSE refered to in the above specified formula: training test or validation?
2- Should i simply used the best combination of test and validation sets as separate subjects ?
Any other feedback to improve the possible output of my NN is very welcome, since as it could be clear i'm at the very basics.
Thanks in advance,

Sign in to comment.

More Answers (0)

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!