Training a Neural Net on the entire dataset after model selection on K-fold Cross Validation: How to overcome overfitting if i don't have a validation and test set?

2 views (last 30 days)
Hi everyone,
I am working on artificial neural networks for application in Movement Analysis. I started using Neural Networks this year and following courses and post on ANSWER and Matlab community i tried to implement a K-fold CV procedure to develop a model for movement classification.
SOME CONSIDERATION: My Dataset is composed by 19 Subjects repeating a movement pattern for 20 times. This movement pattern is composed by 5 sequential phases which are divided in 100 ms observations from 6 sensors: in order to divide the data in 3 indipendent TRAINING, VALIDATION AND TEST SETS i have to include all observation from a subject inside a specific group.
I implemented the overall procedure which i will include at the end of this post. But now i have 2 question arising in my head:
1- Looking at the useful examples from prof. Greg Heath i saw that the R^2 is often used as performance measure to evaluate model. Beside i also read that it is typically recommended for regression problem. Is it possible to use it also in classification ?
2- After i get the results from my 10x10 iteration over weight and hidden neurons different model, should i get the collected information to train the 'optimal' model found on all the entire dataset ? Or should i simply take the best model found even if i don't consider a N°val+N*tst samples ? I ask this becouse i already tried to train a found optimal model an all my data, but off course if i don't specify a validation set the early stop does not work and i fall in the overfitting.
Thanks in advance for every possible help.
Mirko
%% MEMORY CLEAN
clear all; close all; clc
%% LOAD DATASET
datasetfolder='DATA/CLASSIFIER/Table_Classifier';
load(fullfile(cd,datasetfolder,'Table_Classifier.mat'));% ------------------- Load Dataset
x=table2array(DataSET(:,1:end-1))';% ---------------------------------------- Input [IxN°obs.] 252x42563
tc=table2array(DataSET(:,end));% -------------------------------------------- Label Cell Array [1xN°obs.]
targets=[strcmp(tc,'Phase1'),...% ------------------------------------------- Targets [OxN°osserv.] 5x42563
strcmp(tc,'Phase2'),...
strcmp(tc,'Phase3'),...
strcmp(tc,'Phase4'),...
strcmp(tc,'Phase5')]';
%% DIMENSIONALITY OF THE DATASET
[I N]=size(x);
[O ~]=size(targets);
%% DEFINITION OF FOLDS FOR XVALIDATION
% In my case each fold should include all observation from all exercise from a specific subject, DIVISOR is a
% label that indicate the specific subject of an observation.
Sbj=unique(DIVISOR);
loop=0;
% Choose of the type of validation
while loop==0
flag=input(['What validation model you would like to implement?\n',...
' 1 - 5 folds\n 2 - 10 folds\n 3 - LOSOCV\n\n']);
switch flag
case 1
folds = 6;
loop = 1;
case 2
folds = 11;
loop = 1;
case 3
folds = length(SBJ);
loop = 1;
otherwise
loop = 0;
end
end
Basing on the number of loop defined above, i created a cell array 'subgroup' (1,folds) containing the subjects label randomized in fold different groups, it is important to note that if i choose to implement 5-fold X Validation Subgroup will have 5+1 element (one element will be considered as test-set)
  • Subgroup {1}: Sbj1, Sbj7, Sbj5
  • Subgroup {2}: Sbj2, Sbj4
  • Subgroup {3}: Sbj3, Sbj6
At this point i implemented starting from the double loop approach by prof. Greg Heath an expanded approach that:
  1. each element of the Subgroup (i.e. folds) is considered as Test Set
  2. the remaining element are used for k-fold cross validation
  3. a validation loop is iterated for 10 random initialization of initial weights and 10 possible model of hidden neurons
%% IDENTIFICATION OF THE AVERAGE NTRN
% Changing different folds for test and validation implicitly change the number of training samples
% to calculate N° of hidden neurons, so i evaluate an average N° of training samples among all possible selections.
Ntr_av=0;%------------------------------------------------------------------- Average N°trn
for t=1:folds%--------------------------------------------------------------- For each test choice
logicalindext=cellfun(@(x)contains(DIVISOR,x),...
subgroup{t},'un',0);
for v=1:folds%----------------------------------------------------------- For each validation choice
if t~=v
logicalindexv=cellfun(@(x)contains(DIVISOR,x),subgroup{v},'un',0);
TrainSET=find(~any([any(...%------------------------------------- Train indixes
horzcat(logicalindext{:}),2),any(...
horzcat(logicalindexv{:}),2)],2)==1);
Ntr_av=Ntr_av+length(TrainSET);
end
end
end
Ntr_av=Ntr_av/((folds-1)*folds);%-------------------------------------------- Average N°trn
Hmin=10;%-------------------------------------------------------------------- Minimum Hidden nodes number
Hub_av=(Ntr_av*O-O)/(I+O+1);%------------------------------------------------ Upper limit for N° Hidden neuron
Hmax_av = round(Hub_av/10);%------------------------------------------------- Max N° hidden neurons (<<<Hub_av for robust training)
dn=floor((Hmax_av-Hmin)/9);%------------------------------------------------- Step dn
Neurons=(0:9).*dn+Hmin;%----------------------------------------------------- I define 10 possible models of hidde layer differentiatig for dn
% Hidden neurons
MSE00 = mean(var(targets',1));%---------------------------------------------- Naive Constant model reference on all dataset
%% NEURAL NETWORK MODEL
for t=1:folds%--------------------------------------------------------------- For each fold t
logicalindext=cellfun(@(x)contains(DIVISOR,x),...%----------------------- I define the current fold as TEST SET finding all the indixes corresponding
subgroup{t},'un',0); % to the label in subgroup{t}
ITST=find(any(horzcat(logicalindext{:}),2)==1);
MSE00tst = mean(var(targets(:,ITST)',1));%------------------------------- Naive Constant model reference on the Test SET
IVAL=cell(1,folds-1);%--------------------------------------------------- Declaration of folds-1 couple of possible training
ITRN=cell(1,folds-1);%--------------------------------------------------- and validation indixes and respective MSE00
MSE00val=zeros(1,folds-1);
MSE00trn=zeros(1,folds-1);
count=1;
for v=1:folds%----------------------------------------------------------- For each fold
if t~=v%------------------------------------------------------------- different from Test SET t
logicalindexv=cellfun(@(x)contains(DIVISOR,x),subgroup{v},'un',0);
IVAL{1,count}=find(any(...%-------------------------------------- I identify the indixes of validation and training
horzcat(logicalindexv{:}),2)==1);
ITRN{1,count}=find(~any([any(...
horzcat(logicalindext{:}),2),any(...
horzcat(logicalindexv{:}),2)],2)==1);
MSE00val(1,count)=mean(var(targets(:,ITRN{1,count})',1));%------- And i calculate the MSE00 references
MSE00trn(1,count)=mean(var(targets(:,IVAL{1,count})',1));
count=count+1;
end
end
S=cell(1,10);%----------------------------------------------------------- Across each validation loop i have to use the same initial weight
rng(0);%----------------------------------------------------------------- Default random state
for s=1:10
S{s}=rng;%----------------------------------------------------------- I save 10 different random states to be resettled across 10
rand; % different validation loop (initial weight iteration)
end
rng(0);%----------------------------------------------------------------- Default random state
% Performance measures
perf_xentrval=zeros(10,10);
perf_xentrtrn=zeros(10,10);
perf_xentrtst=zeros(10,10);
perf_mseval=zeros(10,10);
perf_msetrn=zeros(10,10);
perf_msetst=zeros(10,10);
perf_R2=zeros(10,10);
perf_R2trn=zeros(10,10);
perf_R2tst=zeros(10,10);
perf_R2val=zeros(10,10);
for n=1:10%-------------------------------------------------------------- For each model of hidden neurons
H=Neurons(n);%------------------------------------------------------- I use the model defined previously
parfor i=1:10%------------------------------------------------------- For each iteration of initial random weight
fprintf(['Validation for Model with: ',num2str(H),' neurons and randomization ',num2str(i),'\n']);
tic
[val_xentrval,val_xentrtrn,val_xentrtst,val_mseval,val_msetrn,val_msetst,val_R2,val_R2trn,val_R2val,val_R2tst]=ValidationLoops...
(S{i},MSE00,MSE00trn,MSE00tst,MSE00val,folds,x,targets,H,ITRN,IVAL,ITST)
toc
The function validationLoops has been created to overcome parfor problem and errors in multiprocessing comands:
function [val_xentrval,val_xentrtrn,val_xentrtst,val_mseval,val_msetrn,val_msetst,val_R2,val_R2trn,val_R2val,val_R2tst]...
=ValidationLoops(S,MSE00,MSE00trn,MSE00tst,MSE00val,folds,x,targets,H,ITRN,IVAL,ITST)
% Validation performance Variables
val_xentrval = zeros(1,folds-1);
val_xentrtrn = zeros(1,folds-1);
val_xentrtst = zeros(1,folds-1);
val_mseval = zeros(1,folds-1);
val_msetrn = zeros(1,folds-1);
val_msetst = zeros(1,folds-1);
val_R2 = zeros(1,folds-1);
val_R2trn = zeros(1,folds-1);
val_R2val = zeros(1,folds-1);
val_R2tst = zeros(1,folds-1);
for v=1:folds-1%---------------------------------------------- For each validation fold
net=patternnet(H,'trainlm');%----------------------------- Define the net
net.performFcn = 'mse';%---------------------------------- Loss function
net.divideFcn='divideind';%------------------------------- Setting TRAINING TEST AND VALIDATION
net.divideParam.trainInd=ITRN{v}; % TrainingSET
net.divideParam.valInd=IVAL{v}; % ValidationSET
net.divideParam.testInd=ITST; % TestSET
rng(S); % Reset initial weight, across validation loops i evaluate the SAME MODEL in terms
% of Neurons and Initial Weighy
net=configure(net,x,targets);
[net,tr,y,e]=train(net,x,targets);
% Save Performance variables
val_xentrval(v) = crossentropy(net,targets(:,IVAL{v}),...%------- Crossentropy
y(:,IVAL{v}));
val_xentrtrn(v) = crossentropy(net,targets(:,ITRN{v}),...
y(:,ITRN{v}));
val_xentrtst(v) = crossentropy(net,targets(:,ITST),...
y(:,ITST));
val_mseval(v) = tr.best_vperf;%---------------------------------- MSE
val_msetrn(v) = tr.best_perf;
val_msetst(v) = tr.best_tperf;
val_R2(v) = 1 - mse(e)/MSE00;%----------------------------------- R2
val_R2trn(v) = 1 - tr.best_perf/MSE00trn(v);
val_R2val(v) = 1 - tr.best_vperf/MSE00val(v);
val_R2tst(v) = 1 - tr.best_tperf/MSE00tst;
end
After the validation i save the results of model with N neurons and I random iteration of initial weights as a mean of results obtained in validation loops.
perf_xentrval(n,i)=...
mean(val_xentrval);
perf_xentrtrn(n,i)=...
mean(val_xentrtrn);
perf_xentrtst(n,i)=...
mean(val_xentrtst);
perf_mseval(n,i)=...
mean(val_mseval);
perf_msetrn(n,i)=...
mean(val_msetrn);
perf_msetst(n,i)=...
mean(val_msetst);
perf_R2(n,i)=...
mean(val_R2);
perf_R2trn(n,i)=...
mean(val_R2trn);
perf_R2val(n,i)=...
mean(val_R2val);
perf_R2tst(n,i)=...
mean(val_R2tst);
end
end
% This process is repeated for each choice of different Test Set
eval(['T',num2str(t),'Test_model.data.xentrval=perf_xentrval']);
eval(['T',num2str(t),'Test_model.data.xentrtrn=perf_xentrtrn']);
eval(['T',num2str(t),'Test_model.data.xentrtst=perf_xentrtst']);
eval(['T',num2str(t),'Test_model.data.mseval=perf_mseval']);
eval(['T',num2str(t),'Test_model.data.msetrn=perf_msetrn']);
eval(['T',num2str(t),'Test_model.data.msetst=perf_msetst']);
eval(['T',num2str(t),'Test_model.data.R2=perf_R2']);
eval(['T',num2str(t),'Test_model.data.R2val=perf_R2val']);
eval(['T',num2str(t),'Test_model.data.R2trn=perf_R2trn']);
eval(['T',num2str(t),'Test_model.data.R2tst=perf_R2tst']);
eval(['T',num2str(t),'Test_model.HiddenNeurons=Neurons']);
eval(['T',num2str(t),'Test_model.SET.Sbj=subgroup{t};']);
eval(['T',num2str(t),'Test_model.SET.Ind=ITST;']);
end
delete(gcp('nocreate'))

Answers (0)

Categories

Find more on Deep Learning Toolbox in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!