Performance Evaluation of Threshold-Based and k-means Clustering Algorithms Using Iris Dataset

Mamta       Mittal; Rajendra     Kumar    Sharma; Varinder     Pal    Singh

doi:10.2174/1872212112666180510153006

Abstract

Background: Clustering is one of the data mining tools which classify the raw data reasonably into disjoint clusters. Researchers have developed many algorithms to cluster large data sets based on specific parameters.

Objective: This study is centered around the popular partitioning-based technique, i.e., k-means. It requires the number of clusters to be generated as an input parameter; it does not provide a global solution of the problem; and it is sensitive to outliers and initial seed selection.

Methods: In this paper, authors have discussed threshold-based clustering method, single pass method, which overcomes the above limitations but it requires a threshold value as an input parameter. Other researchers’ work related to k-means published in patent form is noteworthy and paving path for the researchers.

Results: To assess the quality of clustering, numerous validity measures and indices have been assessed on the Iris dataset for both k-means and threshold-based clustering algorithms. It has been observed from the experiments that threshold-based method generates more separated and compact clusters, in addition, there is significant improvement in the validity indices.

Conclusion: Threshold-based clustering generates the clusters automatically which are not sensitive to initial seeds selection and outlier; it is more scalable. It will inevitably be an efficient approach of partitioning based clustering whenever one will select the threshold value carefully or will propose new functions for deciding the value of threshold.

Keywords: Clustering, k-means, threshold-based clustering, validity indices, validity measures, partitioning-based technique.

Graphical Abstract

[1] 
J. Han, M. Kamber,  and J. Pei, Data Mining Concepts and Techniques.. 3rd ed San Francisco, USA: Morgan Kaufmann Publishers,
2006
[2] 
K. Jain, M.N. Murty,  and P.J. Flynn, "“Data clustering: a review”, ACM", Comput. Surv.(CSUR),, vol. 31, pp. 264-323, 1999.
[3] 
S. Lloyd, "Least squares quantization in PCM", IEEE Trans. Inf. Theory, vol. 28, pp. 129-137, 1982.
[4] 
J.B. MacQueen, "Some methods for classification and analysis of multivariate observations", In Fifth Symposium on Mathematical Statistics and ProbabilityBerkley, . 1967, pp. 281-297
[5] 
D. Arthur,  and S. Vassilvitskii, "k-means++: the advantage of careful seeding", In Eighteenth Symposium on Discrete Analysis New Orleans, Louisiana 2007, pp. 1027-1035.
[6] 
A.M. Fahim, A.M. Salem, F.A. Tokey,  and M. Ramadan, "An Efficient enhanced k-means clustering algorithm", J. Zhej. Univ. Sc. A, vol. 7, pp. 1626-1633, 2006.
[7] 
K. Jain, "Data clustering: 50 years beyond k-means", Pattern Recognit. Lett., vol. 31, pp. 651-666, 2010.
[8] 
E. Murat, C. Nazif,  and S. Sadullah, "A new algorithm for initial cluster centres in k-means algorithm", Pattern Recognit. Lett., vol. 32, pp. 1701-1705, 2011.
[9] 
D. Reddya,  and K.J. Prasanta, "Initialization for k-means clustering using voronoi diagram", Proc. Tech., vol. 4, pp. 395-400, 2012.
[10] 
I. Melnykov,  and V. Melnykov, "On k-means algorithm with the use of mahalanobis distances", Stat. Probab. Lett., vol. 84, pp. 88-95, 2014.
[11] 
G. Salton, The SMART retrieval system., Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 1971.
[12] 
G. Salton,  and A. Wong, "Generation and search of clustered files", ACM Trans. Database Syst., vol. 3, pp. 321-346, 1978.
[13] 
M. Mittal, V.P. Singh,  and R.K. Sharma, Random automatic detection of clusters.In  Image Information Processing., ICIIP: Shimla, India, 2011, pp. 1-6.
[14] 
M. Mittal, V.P. Singh,  and R.K. Sharma, "Validation of k-means and threshold based clustering methods", Int. J. Adv. Technol., vol. 5, pp. 153-160, 2014.
[15] 
M. Mittal, V.P. Singh,  and R.K. Sharma, "Modified single pass clustering with variable threshold approach", Int. J. Innov. Comput., Inf. Control, vol. 11, pp. 375-386, 2015.
[16] 
U. Chaudhari, J. Navratil,  and G. Ramaswamy, Efficient recursive clustering based on a splitting function derived from successive eigen-decompositions. U.S. Patent 20,030,158,853, 2003.
[17] 
C. Ordonez, K-means clustering using structured query language (SQL) statements and sufficient statistics. U.S. Patent 7,359,913,
2008
[18] 
M. Halkidi, M. Vazirgiannis,  and I. Batistakis, "On clustering validation techniques", J. Intell. Inf. Syst., vol. 17, pp. 107-114, 2001.
[19] 
C. Dunn, "Well separated clusters and optimal fuzzy partitions", J. Cyb., vol. 4, pp. 95-104, 1974.
[20] 
L. Davies,  and D.W. Bouldin, "A cluster separation measure", IEEE Trans. Pattern Anal. Mach. Intell., vol. 1, pp. 224-227, 1979.
[21] 
"Iris data set: Available at:", http://archive.ics.uci.edu/ml/datasets/iris

Rights & Permissions Print Cite

Article Metrics

53

4

2

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1872212112666180510153006	Print ISSN 1872-2121
Publisher Name Bentham Science Publisher	Online ISSN 2212-4047

Recent Patents on Engineering

Performance Evaluation of Threshold-Based and k-means Clustering Algorithms Using Iris Dataset

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract