How to Cluster Dataset and remove outlier in MATLAB
Show older comments
Hello, I have the following dataset, In which i have four features in each column.
I want to cluster Dataset. I have go through K-means it required Number of clusters as input.
2 Comments
Constantino Carlos Reyes-Aldasoro
on 16 Jan 2023
What have you tried?
Med Future
on 17 Jan 2023
Answers (2)
Sai Pavan
on 17 Apr 2024
Hello,
I understand that you want to cluster the 4-feature dataset and remove the outliers from the dataset. This task can be carried out using the following workflow:
- Determine the optimal number of clusters: The elbow method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and looking for the "elbow" point where the rate of decrease sharply changes. This point is often considered a good choice for the number of clusters.
- Perform K-means clustering: After determining the optimal number of clusters, perform k-means clustering.
- Removing outliers: Outliers can be detected and removed based on their distance from the centroid of their assigned cluster. A common approach is to remove points that are farthest from the centroid beyond a certain threshold.
Please refer to the below code snippet that illustrates the above workflow:
data = Dataset;
wcss = [];
for k = 1:10 % Test up to 10 clusters
[idx, C, sumd] = kmeans(data, k, 'Replicates', 10);
wcss(k) = sum(sumd);
end
plot(1:10, wcss);
xlabel('Number of clusters');
ylabel('WCSS');
title('Elbow Method');
optimalK = % the optimal number of clusters you determined
[idx, C, sumd] = kmeans(data, optimalK, 'Replicates', 10);
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
centroid = C(i, :);
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);
Hope it helps!
2 Comments
Med Future
on 23 Apr 2024
Edited: Walter Roberson
on 24 Apr 2024
Med Future
on 24 Apr 2024
Walter Roberson
on 24 Apr 2024
Moved: Walter Roberson
on 24 Apr 2024
load Question
dataset1=data(:,[2 4]);
dataset1 is created from 2 columns of data
% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);
C is created from dataset1 so it has two columns
dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
data has 6 columns, so clusterPoints has 6 columns
centroid = C(i, :);
centroid is created from C so it has two columns
whos clusterPoints centroid
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
You are trying to subtract something with 2 columns from something with 6 columns, which is an error
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);
1 Comment
Med Future
on 25 Apr 2024
Categories
Find more on Statistics and Machine Learning Toolbox in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!