kmeans with centroids from previous analysis
Show older comments
Hello everyone,
I wanted to confirm whether my approach is right. I have centroids from a previous kmeans analysis, and now I'd like to extract the membership indexes for these clusters from new data. Am I correct by using:
SubjectMembershipIndex = kmeans(Data, [], 'Distance','cityblock', 'Start', PreviousCentroids);
Thanks!
Best
Hans
Accepted Answer
More Answers (1)
I think some explanation is needed here;
The second time you call kmeans, it will do a kmeans all over again but starting with those centroids as seeds. So the returned centroids will be different the second time than the first time though they should be close. However if you now want to use those same cluster centroids as starting seeds for a totally new set of data, the new centroids may or may not be close to those from the first time. The second centroids depends on what data you actually plug into kmeans the second time. See this demo where I did kmeans twice with k=2 for 2 clusters. First I do it on the two widely separated clusters, then I do it on the two inner clusters using the centroids from the outer clusters.
% Initialization steps.
clc; % Clear the command window.
close all; % Close all figures (except those of imtool.)
clear; % Erase all existing variables. Or clearvars if you want.
workspace; % Make sure the workspace panel is showing.
format long g;
format compact;
fontSize = 8;
%======================================================================================================%------------------------------------------------------------------------------------------------
% FIRST CREATE SAMPLE DATA.
% Make up 4 clusters with 150 points each.
pointsPerCluster = 150;
spread = 0.03;
offsets = [0.3, 0.5, 0.7, 0.9];
% offsets = [0.62, 0.73, 0.84, 0.95];
xa = spread * randn(pointsPerCluster, 1) + offsets(1);
ya = spread * randn(pointsPerCluster, 1) + offsets(1);
xb = spread * randn(pointsPerCluster, 1) + offsets(2);
yb = spread * randn(pointsPerCluster, 1) + offsets(2);
xc = spread * randn(pointsPerCluster, 1) + offsets(3);
yc = spread * randn(pointsPerCluster, 1) + offsets(3);
xd = spread * randn(pointsPerCluster, 1) + offsets(4);
yd = spread * randn(pointsPerCluster, 1) + offsets(4);
%-------------------------------------------------------------------------------------------------------------------------------------------
% First let's run kmeans with 2 clusters a & d
x = [xa; xd];
y = [ya; yd];
xy = [x, y];
%-------------------------------------------------------------------------------------------------------------------------------------------
% K-MEANS CLUSTERING.
% Now do the initial kmeans clustering.
% Determine what the best k is:
% evaluationObject = evalclusters(xy, 'kmeans', 'DaviesBouldin', 'klist', [3:10])
% Do the kmeans with that k (evaluationObject.OptimalK should be 2).
evaluationObject.OptimalK = 2;
[assignedClass, clusterCenters] = kmeans(xy, evaluationObject.OptimalK);
clusterCenters % Echo to command window
% Do a scatter plot with the original class numbers assigned by kmeans.
hfig1 = figure;
subplot(1, 2, 1);
gscatter(x, y, assignedClass);
legend('FontSize', fontSize, 'Location', 'northwest');
grid on;
xlabel('x', 'fontSize', fontSize);
ylabel('y', 'fontSize', fontSize);
title('Original Class Numbers Assigned by kmeans()', 'fontSize', fontSize);
% Plot the class number labels on top of the cluster.
hold on;
for row = 1 : size(clusterCenters, 1)
text(clusterCenters(row, 1), clusterCenters(row, 2), num2str(row), 'FontSize', 25, 'FontWeight', 'bold', 'HorizontalAlignment', 'center', 'VerticalAlignment', 'middle');
end
hold off;
hfig1.WindowState = 'maximized'; % Maximize the figure window so that it takes up the full screen.
% IMPORTANT NOTE: BECAUSE OF RANDOMNESS, SOMETIMES THE LOWER LEFT CLUSTER
% IS LABELED 1 AND SOMETIMES IT'S LABELED 2.
%-------------------------------------------------------------------------------------------------------------------------------------------
% K-MEANS CLUSTERING.
% Now do the kmeans clustering again, using the same data and the centroids from before.
PreviousCentroids = clusterCenters;
[SubjectMembershipIndex, newCentroids1] = kmeans(xy, [], 'Distance','cityblock', 'Start', PreviousCentroids);
fprintf('Using the same data (will be close but not exact):\n')
newCentroids1
% Note the new centroids are close to, but not exactly the same as the previous centroids.
% Now do the kmeans clustering again, using the centroids from before,
% but with new data -- the b and c clusters instead of the a and d clusters.
x2 = [xb; xc];
y2 = [yb; yc];
xy2 = [x2, y2];
[SubjectMembershipIndex, newCentroids2] = kmeans(xy2, [], 'Distance','cityblock', 'Start', PreviousCentroids);
fprintf('Using different data (could be very different depending on the new data:\n')
newCentroids2
% Do a scatter plot with the original class numbers assigned by kmeans.
subplot(1, 2, 2);
gscatter(x, y, assignedClass);
hold on;
gscatter(x2, y2, SubjectMembershipIndex);
legend('FontSize', fontSize, 'Location', 'northwest');
grid on;
xlabel('x', 'fontSize', fontSize);
ylabel('y', 'fontSize', fontSize);
title('Class Numbers Assigned by kmeans()', 'fontSize', fontSize);
% Plot the class number labels on top of the cluster.
hold on;
for row = 1 : size(newCentroids2, 1)
text(newCentroids2(row, 1), newCentroids2(row, 2), num2str(row), 'FontSize', 25, 'FontWeight', 'bold', 'HorizontalAlignment', 'center', 'VerticalAlignment', 'middle');
end
hold off;
Things to note here: The cluster centroids the second time are way different than the first time and are centered where the inner 2 clusters are because that's the data I told it to determine classes for. Note that for these very widely separated clusters the class labels are the same (I ran it dozens of times to check). HOWEVER for mixed/overlapping clusters I believe there might be some points in the "overlap" region that get assigned to class #1 during one run but during the next run they might get assigned to class #2. Of course that's always the case even if you don't re-use centroids. Sometimes with points in the overlap region they might be assigned to different class (cluster) numbers due to the randomness inherent in the algorithm.
(Hope this wasn't too confusing - reread it several times if it is.)
5 Comments
Hans van der Horn
on 8 Dec 2023
Walter Roberson
on 8 Dec 2023
When you means even with initial centroid it will spend some time trying to optimize the cluster membership. It doesn't always take much to cause points to wander between clusters, for clusters to move significantly.
If you do not want iteration to find new centroids, then do not kmeans, just evaluate distances using pdist2.
Hans van der Horn
on 8 Dec 2023
Image Analyst
on 8 Dec 2023
Yep, I agree with your last paragraph and with Walter. If you just want to know which centroid is closest, then just compute the distance of your new points from those centroids rather than doing kmeans again. Like he said, you can use pdist2. For each point, whichever centroid has the smaller distance is the class that point should be assigned to.
Hans van der Horn
on 9 Dec 2023
Categories
Find more on k-Means and k-Medoids Clustering in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!