How to use kmeans function on data stored by datastore function?

2 views (last 30 days)
Ahmed Hamed
Ahmed Hamed on 29 Apr 2016
Edited: Josh Meyer on 17 Jul 2017
I'm trying to cluster big data using kmeans, i found a code that can do something similar here you are
Mu = bsxfun(@times,ones(20,30),(1:20)'); % Gaussian mixture mean
rn30 = randn(30,30);
Sigma = rn30'*rn30; % Symmetric and positive-definite covariance
Mdl = gmdistribution(Mu,Sigma);
rng(1); % For reproducibility
X = random(Mdl,10000);
pool = parpool; % Invokes workers
stream = RandStream('mlfg6331_64'); % Random number stream
options = statset('UseParallel',1,'UseSubstreams',1,...
tic; % Start stopwatch timer
[idx,C,sumd,D] = kmeans(X,20,'Options',options,'MaxIter',10000,...
toc % Terminate stopwatch timer
But as you can see, X is double.
My problem is that i have a file named HIS.csv and i used the datastore function to store it as follows
ds = datastore('HIS_all.csv', 'DatastoreType', 'tabulartext','TreatAsMissing', 'NA');
when i tried
[idx,C,sumd,D] = kmeans(ds,20,'Options',options,'MaxIter',10000, 'Display','final','Replicates',10);
i get the following error
Undefined function 'isnan' for input arguments of type ''.
Error in kmeans (line 158)
wasnan = any(isnan(X),2);
Any suggestions?

Answers (1)

Josh Meyer
Josh Meyer on 15 Jul 2017
Edited: Josh Meyer on 17 Jul 2017
Datastore is just a framework for loading small chunks of the data at a time, so you can't call generic functions directly on the datastore. Instead try converting the datastore into a tall array first:
T = tall(ds);
The kmeans function supports tall arrays, so once the data is in this format you can use the function. Note that there are some limitations to using kmeans on a tall array, so some of the NV pairs you specified might not work. The limitations are outlined here:

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!