Calinski-Harabasz criterion clustering evaluation object
CalinskiHarabaszEvaluation is an object consisting of sample data
X), clustering data (
OptimalY), and Calinski-Harabasz
criterion values (
CriterionValues) used to
evaluate the optimal number of clusters (
OptimalK). The Calinski-Harabasz
criterion is sometimes called the variance ratio criterion (VRC). Well-defined clusters have a
large between-cluster variance and a small within-cluster variance. The optimal number of
clusters corresponds to the solution with the highest Calinski-Harabasz index value. For more
information, see Calinski-Harabasz Criterion.
Create a Calinski-Harabasz criterion clustering evaluation object by using the
evalclusters function and specifying the criterion as
You can then use
compact to create a compact version of the
Calinski-Harabasz criterion clustering evaluation object. The function removes the contents of
Clustering Evaluation Properties
CriterionName — Name of criterion
This property is read-only.
Name of the criterion used for clustering evaluation, returned as
Sample Data Properties
Evaluate Clustering Solution Using Calinski-Harabasz Criterion
Evaluate the optimal number of clusters using the Calinski-Harabasz clustering evaluation criterion.
fisheriris data set. The data contains length and width measurements from the sepals and petals of three species of iris flowers.
Evaluate the optimal number of clusters using the Calinski-Harabasz criterion. Cluster the data using
rng("default") % For reproducibility evaluation = evalclusters(meas,"kmeans","CalinskiHarabasz","KList",1:6)
evaluation = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068] OptimalK: 3
OptimalK value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.
Plot the Calinski-Harabasz criterion values for each number of clusters tested.
The plot shows that the highest Calinski-Harabasz value occurs at three clusters, suggesting that the optimal number of clusters is three.
Create a grouped scatter plot to examine the relationship between petal length and width. Group the data by suggested clusters.
PetalLength = meas(:,3); PetalWidth = meas(:,4); clusters = evaluation.OptimalY; gscatter(PetalLength,PetalWidth,clusters,,"xod");
The plot shows cluster 3 in the lower-left corner, completely separated from the other two clusters. Cluster 3 contains flowers with the smallest petal widths and lengths. Cluster 1 is in the upper-right corner, and contains flowers with the largest petal widths and lengths. Cluster 2 is near the center of the plot, and contains flowers with measurements between these two extremes.
The Calinski-Harabasz criterion is sometimes called the variance ratio criterion (VRC). The Calinski-Harabasz index is defined as
where SSB is the overall between-cluster variance, SSW is the overall within-cluster variance, k is the number of clusters, and N is the number of observations.
The overall between-cluster variance SSB is defined as
where k is the number of clusters, ni is the number of observations in cluster i, mi is the centroid of cluster i, m is the overall mean of the sample data, and is the L2 norm (Euclidean distance) between the two vectors.
The overall within-cluster variance SSW is defined as
where k is the number of clusters, x is a data point, ci is the ith cluster, mi is the centroid of cluster i, and is the L2 norm (Euclidean distance) between the two vectors.
Well-defined clusters have a large between-cluster variance (SSB) and a small within-cluster variance (SSW). The larger the VRCk ratio, the better the data partition. To determine the optimal number of clusters, maximize VRCk with respect to k. The optimal number of clusters corresponds to the solution with the highest Calinski-Harabasz index value.
The Calinski-Harabasz criterion is best suited for k-means clustering solutions with squared Euclidean distances.
 Calinski, T., and J. Harabasz. “A dendrite method for cluster analysis.” Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1–27.
Introduced in R2013b