Abstract
Background: Clustering is one of the data mining tools which classify the raw data reasonably into disjoint clusters. Researchers have developed many algorithms to cluster large data sets based on specific parameters.
Objective: This study is centered around the popular partitioning-based technique, i.e., k-means. It requires the number of clusters to be generated as an input parameter; it does not provide a global solution of the problem; and it is sensitive to outliers and initial seed selection.
Methods: In this paper, authors have discussed threshold-based clustering method, single pass method, which overcomes the above limitations but it requires a threshold value as an input parameter. Other researchers’ work related to k-means published in patent form is noteworthy and paving path for the researchers.
Results: To assess the quality of clustering, numerous validity measures and indices have been assessed on the Iris dataset for both k-means and threshold-based clustering algorithms. It has been observed from the experiments that threshold-based method generates more separated and compact clusters, in addition, there is significant improvement in the validity indices.
Conclusion: Threshold-based clustering generates the clusters automatically which are not sensitive to initial seeds selection and outlier; it is more scalable. It will inevitably be an efficient approach of partitioning based clustering whenever one will select the threshold value carefully or will propose new functions for deciding the value of threshold.
Keywords: Clustering, k-means, threshold-based clustering, validity indices, validity measures, partitioning-based technique.
Graphical Abstract