Choose Cluster Analysis Method

This topic provides a brief overview of the available clustering methods in Statistics and Machine Learning Toolbox™.

Clustering Methods

Cluster analysis, also called segmentation analysis or taxonomy analysis, is a common unsupervised learning method. Unsupervised learning is used to draw inferences from data sets consisting of input data without labeled responses. For example, you can use cluster analysis for exploratory data analysis to find hidden patterns or groupings in unlabeled data.

Cluster analysis creates groups, or clusters, of data. Objects that belong to the same cluster are similar to one another and distinct from objects that belong to different clusters. To quantify "similar" and "distinct," you can use a dissimilarity measure (or distance metric) that is specific to the domain of your application and your data set. Also, depending on your application, you might consider scaling (or standardizing) the variables in your data to give them equal importance during clustering.

Statistics and Machine Learning Toolbox provides functionality for these clustering methods:

Hierarchical Clustering

Hierarchical clustering groups data over a variety of scales by creating a cluster tree, or dendrogram. The tree is not a single set of clusters, but rather a multilevel hierarchy, where clusters at one level combine to form clusters at the next level. This multilevel hierarchy allows you to choose the level, or scale, of clustering that is most appropriate for your application. Hierarchical clustering assigns every point in your data to a cluster.

Use clusterdata to perform hierarchical clustering on input data. clusterdata incorporates the pdist, linkage, and cluster functions, which you can use separately for more detailed analysis. The dendrogram function plots the cluster tree. For more information, see Introduction to Hierarchical Clustering.

k-Means and k-Medoids Clustering

k-means clustering and k-medoids clustering partition data into k mutually exclusive clusters. These clustering methods require that you specify the number of clusters k. Both k-means and k-medoids clustering assign every point in your data to a cluster; however, unlike hierarchical clustering, these methods operate on actual observations (rather than dissimilarity measures), and create a single level of clusters. Therefore, k-means or k-medoids clustering is often more suitable than hierarchical clustering for large amounts of data.

Use kmeans and kmedoids to implement k-means clustering and k-medoids clustering, respectively. For more information, see Introduction to k-Means Clustering and k-Medoids Clustering.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN is a density-based algorithm that identifies arbitrarily shaped clusters and outliers (noise) in data. During clustering, DBSCAN identifies points that do not belong to any cluster, which makes this method useful for density-based outlier detection. Unlike k-means and k-medoids clustering, DBSCAN does not require prior knowledge of the number of clusters.

Use dbscan to perform clustering on an input data matrix or on pairwise distances between observations. For more information, see Introduction to DBSCAN.

Gaussian Mixture Model

A Gaussian mixture model (GMM) forms clusters as a mixture of multivariate normal density components. For a given observation, the GMM assigns posterior probabilities to each component density (or cluster). The posterior probabilities indicate that the observation has some probability of belonging to each cluster. A GMM can perform hard clustering by selecting the component that maximizes the posterior probability as the assigned cluster for the observation. You can also use a GMM to perform soft, or fuzzy, clustering by assigning the observation to multiple clusters based on the scores or posterior probabilities of the observation for the clusters. A GMM can be a more appropriate method than k-means clustering when clusters have different sizes and different correlation structures within them.

Use fitgmdist to fit a gmdistribution object to your data. You can also use gmdistribution to create a GMM object by specifying the distribution parameters. When you have a fitted GMM, you can cluster query data by using the cluster function. For more information, see Cluster Using Gaussian Mixture Model.

k-Nearest Neighbor Search and Radius Search

k-nearest neighbor search finds the k closest points in your data to a query point or set of query points. In contrast, radius search finds all points in your data that are within a specified distance from a query point or set of query points. The results of these methods depend on the distance metric that you specify.

Use the knnsearch function to find k-nearest neighbors or the rangesearch function to find all neighbors within a specified distance of your input data. You can also create a searcher object using a training data set, and pass the object and query data sets to the object functions (knnsearch and rangesearch). For more information, see Classification Using Nearest Neighbors.

Spectral Clustering

Spectral clustering is a graph-based algorithm for finding k arbitrarily shaped clusters in data. The technique involves representing the data in a low dimension. In the low dimension, clusters in the data are more widely separated, enabling you to use algorithms such as k-means or k-medoids clustering. This low dimension is based on eigenvectors of a Laplacian matrix. A Laplacian matrix is one way of representing a similarity graph that models the local neighborhood relationships between data points as an undirected graph.

Use spectralcluster to perform spectral clustering on an input data matrix or on a similarity matrix of a similarity graph. spectralcluster requires that you specify the number of clusters. However, the algorithm for spectral clustering also provides a way to estimate the number of clusters in your data. For more information, see Partition Data Using Spectral Clustering.

Comparison of Clustering Methods

This table compares the features of available clustering methods in Statistics and Machine Learning Toolbox.

Method	Basis of Algorithm	Input to Algorithm	Requires Specified Number of Clusters	Cluster Shapes Identified	Useful for Outlier Detection
Hierarchical Clustering	Distance between objects	Pairwise distances between observations	No	Arbitrarily shaped clusters, depending on the specified `'Linkage'` algorithm	No
k-Means Clustering and k-Medoids Clustering	Distance between objects and centroids	Actual observations	Yes	Spheroidal clusters with equal diagonal covariance	No
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)	Density of regions in the data	Actual observations or pairwise distances between observations	No	Arbitrarily shaped clusters	Yes
Gaussian Mixture Models	Mixture of Gaussian distributions	Actual observations	Yes	Spheroidal clusters with different covariance structures	Yes
Nearest Neighbors	Distance between objects	Actual observations	No	Arbitrarily shaped clusters	Yes, depending on the specified number of neighbors
Spectral Clustering (Partition Data Using Spectral Clustering)	Graph representing connections between data points	Actual observations or similarity matrix	Yes, but the algorithm also provides a way to estimate the number of clusters	Arbitrarily shaped clusters	No