Customer Segmentation: Customers can be grouped based on different factors like demographics, regions, and preferences. Grouping helps businesses tailor and strategize marketing efforts that improve sales and retention.
Inventory Management: Efficient inventory management facilitates a streamlined supply chain. Clustering can identify seasonal trends related to product purchases and items that are highly in demand and help manage stocks better.
This section provides the summary and statistical information that help evaluate the quality of the clusters and how well the data points are grouped. Clusters info option will be enabled once the clustering is applied.
This section provides the following details:
Note
This section provides the details specific to each cluster, such as the number of data points in each cluster and the average (centroid) or the mode of the data points for each factor.
Analysis of variance is calculated only for the K-means algorithm. ANOVA is used to evaluate whether the centroids (or means) of the clusters are significantly different from each other in terms of the values of the factors used for clustering. It is also a statistical significance test that is used to check whether the null hypothesis can be rejected or not during hypothesis testing.
Within the Sum of Squares - It calculates how much the individual data points within each group differ from the mean of that group. This can also be called as Mean Square between the Clusters (MSB).
Between the Sum of Squares - It calculates how much the mean values of different groups differ from the overall mean value. This can also be called as Mean Square within the Clusters (MSW).
F- Statistic Value
The F-Statistic calculates the ratio of the Mean Square Between (MSB) the clusters to the Mean Square Within (MSW) the clusters. If the F-Statistic is greater than the critical value, we can conclude that the data points are well clustered.
P - Value
It helps to decide whether the differences between groups are likely to have occurred by chance or if they are statistically significant.
| Factors | F-Statistic | Between the Sum of Squares | Degrees of Freedom (between clusters) | Within the Sum of Squares | Degrees of Freedom (within clusters) |
| Columns used for clustering | MSB/MSW MSB - Mean Square between clusters MSW- Mean Square within clusters. | Calculates the difference between the means across different clusters A large value indicates that the data points are well clustered and there is no overlapping. | k-1 where, k - number of clusters The between-clusters degrees of freedom is calculated based on the number of clusters (groups) being compared. | Calculates the difference between the means within each cluster. | N-k where,
The within-clusters degrees of freedom is calculated based on the number of observations within each cluster and the number of clusters. |
The methods used for clustering depend mainly on the data type of columns based on which the data points are grouped.
K-means
K-means is a ML algorithm for partitioning a dataset into a predefined k number of clusters. Each data point is assigned to a cluster based on the centroid. The goal of this algorithm is to reduce the sum of the distance between the points and corresponding clusters. This method is best suited for grouping data points based on numerical factors. Refer to the section to learn about the working of K- means.
K-modes
K-modes aims to partition a dataset into K clusters, where each cluster contains similar data points. The centroid of each cluster is represented by the most frequent value (mode) for each categorical attribute within the cluster. The algorithm aims to minimize the total dissimilarity between data points and their respective centroids.
K-prototype
K-prototype is used to partition data that contain both numerical and categorical attributes. It calculates the centroid of each clusters using the combination of euclidean distance for numerical features and a matching dissimilarity measure for categorical features. For example, Segmenting customers based on the purchase pattern and age.