kmeans with centroids from previous analysis

Question

0 votes

Hello everyone,

I wanted to confirm whether my approach is right. I have centroids from a previous kmeans analysis, and now I'd like to extract the membership indexes for these clusters from new data. Am I correct by using:

SubjectMembershipIndex = kmeans(Data, [], 'Distance','cityblock', 'Start', PreviousCentroids);

Thanks!

Best

Hans

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Walter Roberson on 8 Dec 2023

0 votes

Yes, that looks good.

I notice that you pass [] for the k value. That will probably not be immediately obvious to readers, but it does exactly match the documentation of the use of the 'Start' option, which says that k will be deduced from the first dimension of the numeric array you pass in, ignoring any k value you passed. So the code should work fine.

... but since a lot of people don't know about that, as a matter of style it would be easier for people to read if you pass in an explicit numeric k value (even if that k value is going to be ignored.) It is faster for people to see an explicit k value there and not think too deeply about it, compared to people asking themselves WTF it means to have [] for a k value, and having to dig into the documentation to find out that the meaning is well defined.

1 Comment
Show -1 older comments Hide -1 older comments

Hans van der Horn on 8 Dec 2023

Dear Walter,

Thanks so much for your anwer, and for the additional details, which are indeed valuable to other users!

I'm glad my approach is valid.

Best

Hans

Sign in to comment.

Answer 2

Image Analyst on 8 Dec 2023

Open in MATLAB Online

0 votes

I think some explanation is needed here;

The second time you call kmeans, it will do a kmeans all over again but starting with those centroids as seeds. So the returned centroids will be different the second time than the first time though they should be close. However if you now want to use those same cluster centroids as starting seeds for a totally new set of data, the new centroids may or may not be close to those from the first time. The second centroids depends on what data you actually plug into kmeans the second time. See this demo where I did kmeans twice with k=2 for 2 clusters. First I do it on the two widely separated clusters, then I do it on the two inner clusters using the centroids from the outer clusters.

% Initialization steps.

clc; % Clear the command window.

close all; % Close all figures (except those of imtool.)

clear; % Erase all existing variables. Or clearvars if you want.

workspace; % Make sure the workspace panel is showing.

format long g;

format compact;

fontSize = 8;

%======================================================================================================%------------------------------------------------------------------------------------------------

% FIRST CREATE SAMPLE DATA.

% Make up 4 clusters with 150 points each.

pointsPerCluster = 150;

spread = 0.03;

offsets = [0.3, 0.5, 0.7, 0.9];

% offsets = [0.62, 0.73, 0.84, 0.95];

xa = spread * randn(pointsPerCluster, 1) + offsets(1);

ya = spread * randn(pointsPerCluster, 1) + offsets(1);

xb = spread * randn(pointsPerCluster, 1) + offsets(2);

yb = spread * randn(pointsPerCluster, 1) + offsets(2);

xc = spread * randn(pointsPerCluster, 1) + offsets(3);

yc = spread * randn(pointsPerCluster, 1) + offsets(3);

xd = spread * randn(pointsPerCluster, 1) + offsets(4);

yd = spread * randn(pointsPerCluster, 1) + offsets(4);

%-------------------------------------------------------------------------------------------------------------------------------------------

% First let's run kmeans with 2 clusters a & d

x = [xa; xd];

y = [ya; yd];

xy = [x, y];

%-------------------------------------------------------------------------------------------------------------------------------------------

% K-MEANS CLUSTERING.

% Now do the initial kmeans clustering.

% Determine what the best k is:

% evaluationObject = evalclusters(xy, 'kmeans', 'DaviesBouldin', 'klist', [3:10])

% Do the kmeans with that k (evaluationObject.OptimalK should be 2).

evaluationObject.OptimalK = 2;

[assignedClass, clusterCenters] = kmeans(xy, evaluationObject.OptimalK);

clusterCenters % Echo to command window

clusterCenters = 2×2

0.903489030397715 0.899675017552768 0.300366633139943 0.302933867371474

% Do a scatter plot with the original class numbers assigned by kmeans.

hfig1 = figure;

subplot(1, 2, 1);

gscatter(x, y, assignedClass);

legend('FontSize', fontSize, 'Location', 'northwest');

grid on;

xlabel('x', 'fontSize', fontSize);

ylabel('y', 'fontSize', fontSize);

title('Original Class Numbers Assigned by kmeans()', 'fontSize', fontSize);

% Plot the class number labels on top of the cluster.

hold on;

for row = 1 : size(clusterCenters, 1)

text(clusterCenters(row, 1), clusterCenters(row, 2), num2str(row), 'FontSize', 25, 'FontWeight', 'bold', 'HorizontalAlignment', 'center', 'VerticalAlignment', 'middle');

end

hold off;

hfig1.WindowState = 'maximized'; % Maximize the figure window so that it takes up the full screen.

% IMPORTANT NOTE: BECAUSE OF RANDOMNESS, SOMETIMES THE LOWER LEFT CLUSTER

% IS LABELED 1 AND SOMETIMES IT'S LABELED 2.

%-------------------------------------------------------------------------------------------------------------------------------------------

% K-MEANS CLUSTERING.

% Now do the kmeans clustering again, using the same data and the centroids from before.

PreviousCentroids = clusterCenters;

[SubjectMembershipIndex, newCentroids1] = kmeans(xy, [], 'Distance','cityblock', 'Start', PreviousCentroids);

fprintf('Using the same data (will be close but not exact):\n')

Using the same data (will be close but not exact):

newCentroids1

newCentroids1 = 2×2

0.904517069272973 0.901778362185216 0.299220117042057 0.301903099795948

% Note the new centroids are close to, but not exactly the same as the previous centroids.

% Now do the kmeans clustering again, using the centroids from before,

% but with new data -- the b and c clusters instead of the a and d clusters.

x2 = [xb; xc];

y2 = [yb; yc];

xy2 = [x2, y2];

[SubjectMembershipIndex, newCentroids2] = kmeans(xy2, [], 'Distance','cityblock', 'Start', PreviousCentroids);

fprintf('Using different data (could be very different depending on the new data:\n')

Using different data (could be very different depending on the new data:

newCentroids2

newCentroids2 = 2×2

0.700636771559469 0.701221394810751 0.496013658780767 0.493213439989797

% Do a scatter plot with the original class numbers assigned by kmeans.

subplot(1, 2, 2);

gscatter(x, y, assignedClass);

hold on;

gscatter(x2, y2, SubjectMembershipIndex);

legend('FontSize', fontSize, 'Location', 'northwest');

grid on;

xlabel('x', 'fontSize', fontSize);

ylabel('y', 'fontSize', fontSize);

title('Class Numbers Assigned by kmeans()', 'fontSize', fontSize);

% Plot the class number labels on top of the cluster.

hold on;

for row = 1 : size(newCentroids2, 1)

text(newCentroids2(row, 1), newCentroids2(row, 2), num2str(row), 'FontSize', 25, 'FontWeight', 'bold', 'HorizontalAlignment', 'center', 'VerticalAlignment', 'middle');

end

hold off;

Things to note here: The cluster centroids the second time are way different than the first time and are centered where the inner 2 clusters are because that's the data I told it to determine classes for. Note that for these very widely separated clusters the class labels are the same (I ran it dozens of times to check). HOWEVER for mixed/overlapping clusters I believe there might be some points in the "overlap" region that get assigned to class #1 during one run but during the next run they might get assigned to class #2. Of course that's always the case even if you don't re-use centroids. Sometimes with points in the overlap region they might be assigned to different class (cluster) numbers due to the randomness inherent in the algorithm.

(Hope this wasn't too confusing - reread it several times if it is.)

5 Comments
Show 3 older comments Hide 3 older comments

Image Analyst on 8 Dec 2023

Yep, I agree with your last paragraph and with Walter. If you just want to know which centroid is closest, then just compute the distance of your new points from those centroids rather than doing kmeans again. Like he said, you can use pdist2. For each point, whichever centroid has the smaller distance is the class that point should be assigned to.

Hans van der Horn on 9 Dec 2023

Great! Thanks to you both for your help!

Sign in to comment.

kmeans with centroids from previous analysis

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

1 Comment
Show -1 older comments Hide -1 older comments

More Answers (1)

5 Comments
Show 3 older comments Hide 3 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

kmeans with centroids from previous analysis

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

1 Comment Show -1 older comments Hide -1 older comments

More Answers (1)

5 Comments Show 3 older comments Hide 3 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

1 Comment
Show -1 older comments Hide -1 older comments

5 Comments
Show 3 older comments Hide 3 older comments