A Review of Clustering Algorithms: Comparison of DBSCAN and K-mean
with Oversampling and t-SNE

Eshan       Bajal; Vipin       Katara; Madhulika       Bhatia; Madhurima       Hooda

doi:10.2174/1872212115666210208222231

Abstract

The two most widely used and easily implementable algorithm for clustering and classification- based analysis of data in the unsupervised learning domain are Density-Based Spatial Clustering of Applications with Noise and K-mean cluster analysis. These two techniques can handle most cases effectively when the data has a lot of randomness with no clear set to use as a parameter as in the case of linear or logistic regression algorithms. However, few papers exist that pit these two against each other in a controlled environment to observe which one reigns supreme and the conditions required for the same. In this paper, a renal adenocarcinoma dataset is analyzed and thereafter both DBSCAN and K-mean are applied on the dataset with subsequent examination of the results. The efficacy of both the techniques in this study is compared and based on them the merits and demerits observed are enumerated. Further, the interaction of t-SNE with the generated clusters are explored.

Keywords: DBSCAN, K-mean, renal cancer, oversampling, t-SNE, clustering, scatter-plot.

Graphical Abstract

[1] 
D. Reinsel, J. Gantz,  and J. Rydning, The digitization of the world: from edge to core., Whitepaper, 2019.
[2] 
A. Oussous, "“Big Data technologies: A survey”, J. King Saud Uni.-", Comput. Info. Sci., vol. 30, no. 4, pp. 431-448, 2018.
[http://dx.doi.org/10.1016/j.jksuci.2017.06.001] 
[3] 
A. Abd El-Sattar, A Survey Machine Learning Techniques on Big-Data Clustering The 54 th Annual Conference on Statistics,  Computer Sciences and Operations Research., Cairo, Egypt, 2019, p. 131.
[4] 
S. Chakraborty, N.K. Nagwani,  and L. Dey, Performance comparison of incremental k-means and incremental dbscan algorithms arXiv preprint arXiv:1406.4751, 2014.
[5] 
M.T. Elbatta,  and W.M. Ashour, "“A dynamic method for discovering density varied clusters”,Int. J. Signal Proces", Image Proces. Patt. Recognit., vol. 6, no. 1, pp. 123-134, 2013.
[6] 
J. Yadav,  and M. Sharma, A Review of K-mean Algorithm Int.  J. eng. trends technol., vol. 4, no. 7, pp. 2972-2976, 2013.
[7] 
H.K. Kanagala,  and V.J.R. Krishnaiah, A comparative study of K-means, DBSCAN and OPTICS2016 International Conference on Computer Communication and Informatics (ICCCI) Coimbatore, India, 2016, pp. 1-6.
[http://dx.doi.org/10.1109/ICCCI.2016.7479923] 
[8] 
M.C. Su, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 674-680, 2001.
[http://dx.doi.org/10.1109/34.927466] 
[9] 
S. Dehuri, C. Mohapatra, A. Ghosh,  and R. Mall, Comparative study of clustering algorithms.
[http://dx.doi.org/10.3923/itj.2006.551.559] 
[10] 
K. Khan, S.U. Rehman, K. Aziz, S. Fong,  and S. Sarasvady, DBSCAN: Past, present and futureThe fifth international conference  on the applications of digital information and web technologies,, 2014, pp. 232-238.
[11] 
I.K. Fodor, A survey of dimension reduction techniques (No. UCRL-ID-148494)., Lawrence Livermore National Lab: CA, US, 2002.
[http://dx.doi.org/10.2172/15002155] 
[12] 
T. Boonchoo, X. Ao, Y. Liu, W. Zhao, F. Zhuang,  and Q. He, "Grid-based DBSCAN: Indexing and inference", Pattern Recognit., vol. 90, pp. 271-284, 2019.
[http://dx.doi.org/10.1016/j.patcog.2019.01.034] 
[13] 
V. Vinodhini,  and M.H.M. Hemalatha, "Comparative Evaluation of Crime Incidence using Enhanced Density based Spatial (Dbscan) Clus-tering", Int. J. Comput. Appl., vol. 122, pp. 16-19, 2015.
[14] 
C. Guan, K.K.F. Yuen,  and F. Coenen, "Particle swarm Optimized Density-based Clustering and Classification: Supervised and unsuper-vised learning approaches", Swarm Evol. Comput., vol. 44, pp. 876-896, 2019.
[http://dx.doi.org/10.1016/j.swevo.2018.09.008] 
[15] 
Y. Zhang, Lecture Notes in Computer Science including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformat-ics.vol. 9652 Springer Verlag LNAI, 2016, pp. 245-256.
[16] 
E. Schubert, J. Sander, M. Ester, H.P. Kriegel,  and X. Xu, "DBSCAN revisited, revisited: why and how you should (still) use DBSCAN", ACM Trans. Database Syst., vol. 42, no. 3, pp. 1-21, 2017.  [TODS].
[http://dx.doi.org/10.1145/3068335] 
[17] 
E. Giacoumidis, “A blind nonlinearity compensator using DBSCAN clustering for coherent optical transmission systems.”, Appl. Sci., Swit-zerland, 2019, p. 9.
[18] 
P. Wang,  and M. Govindarasu, 2018 North American Power Symposium,  NAPS, 2019.
[19] 
C. Lopez, L. Leclercq, P. Krishnakumari, N. Chiabaut,  and H. van Lint, "Revealing the day-to-day regularity of urban congestion patterns with 3D speed maps", Sci. Rep., vol. 7, no. 1, p. 14029, 2017.
[http://dx.doi.org/10.1038/s41598-017-14237-8] [PMID: 29070859] 
[20] 
N.V. Chawla, K.W. Bowyer, L.O. Hall,  and W.P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique", J. Artif. Intell. Res., vol. 16, pp. 321-357, 2002.
[http://dx.doi.org/10.1613/jair.953] 
[21] 
D. Arthur,  and S. Vassilvitskii, k-means++: The advantages of careful seedingProceedings of the Eighteenth Annual ACM-SIAM Sym-posium on Discrete Algorithms New Orleans, Louisiana, 2007, pp. 1027-1035.
[22] 
O.K. Ekseth, Ekseth: hpLysis: a high-performance softwarelibrary for big-data machine-learning.https://bitbucket.org/oekseth/hplysis-cluster-analysis-software/
[23] 
O.K. Ekseth,  and S.O. Hvasshovd, "How an optimized DBSCAN implementation reduces execution-time and memory-requirements for large data-sets", Proceedings of the Patterns, 2019 
[24] 
G.U.F.N. Ogbuabor, "Clustering Algorithm for a Healthcare Dataset Using Silhouette Score Value", Int. J. Comput. Sci. Inf. Technol., vol. 10, pp. 27-37, 2018.
[http://dx.doi.org/10.5121/ijcsit.2018.10203] 
[25] 
S. Monalisa,  and F. Kurnia, "Analysis of DBSCAN and K-means algorithm for evaluating outlier on RFM model of customer behaviour", Telkomnika, vol. 17, pp. 110-117, 2019.  [Telecommunication Computing Electronics and Control].
[http://dx.doi.org/10.12928/telkomnika.v17i1.9394] 
[26] 
I.K. Savvas, A. Stogiannos,  and I.Th. Mazis, "A study of comparative clustering of EU countries using the DBSCAN and k-means tech-niques within the theoretical framework of systemic geopolitical analysis", Int. J. Grid Utility Comput., vol. 8, p. 94, 2017.
[http://dx.doi.org/10.1504/IJGUC.2017.085911] 
[27] 
A.C. Benabdellah, A. Benghabrit,  and I. Bouhaddou, "A survey of clustering algorithms for an industrial context", Procedia Comput. Sci., vol. 148, pp. 291-302, 2019.
[http://dx.doi.org/10.1016/j.procs.2019.01.022] 
[28] 
G. Chen, S.A. Jaradat, N. Banerjee, T. Tanaka, S. Ko,  and M.Q. Zhang, "Evaluation and comparison of clustering algorithms in ana-lyzing ES cell gene expression data", Stat. Sin., pp. 241-262, 2002.
[29] 
M.R. Feizi-Derakhshi,  and E. Zafarani, "Review and comparison between clustering algorithms with duplicate entities detection purpose", Int. J. Comput. Sci. Emerg. Technol, vol. 3, no. 3, 2012.
[30] 
A.B. Ayed, M.B. Halima,  and A.M. Alimi, Survey on clustering methods: Towards fuzzy clustering for big dataSoft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of IEEE., Tunis, Tunisia,, 2014, pp. 331-336.
[31] 
R. Xu,  and D. Wunsch II, "Survey of clustering algorithms", IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 645-678, 2005.
[http://dx.doi.org/10.1109/TNN.2005.845141] [PMID: 15940994] 
[32] 
M. Mittal, L.M. Goyal, D.J. Hemanth,  and J.K. Sethi, "Clustering approaches for high dimensional databases: A review", Wiley Interdiscip. Rev. Data Min. Knowl. Discov. vol. 9, no. 3, pp. e1300, 2019..
[http://dx.doi.org/10.1002/widm.1300] 
[33] 
B. Borah,  and D.K. Bhattacharyya, "An improved sampling-based DBSCAN for large spatial databases",  International conference on intelligent sensing and information processing, proceedings of IEEE Chennai, India, 2004.
[http://dx.doi.org/10.1109/ICISIP.2004.1287631] 
[34] 
S. Kumar,  and N. Verma, "Resolving Issues of Empty Cluster Formation in KMEAN Algorithm Using Advanced Approach", Int. J. Comput. Sci. Eng., vol. 7, no. 6, pp. 443-448, 2019.
[http://dx.doi.org/10.26438/ijcse/v7i6.443448] 
[35] 
K. Li,  and Y. Hu, “Research on unbalanced training samples based on SMOTE algorithm.”, J. Phy.: Conference Series, vol. 1303.IOP Publishing , 2019, no. 1, p. 012095.
[http://dx.doi.org/10.1088/1742-6596/1303/1/012095] 
[36] 
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,  and O. Grisel, "Scikit-learn: Machine learning in Python", J. Mach. Learn. Res., vol. 12, pp. 2825-2830, 2011.
[37] 
L.V.D. Maaten,  and G. Hinton, "Visualizing data using t-SNE", J. Mach. Learn. Res., vol. 9, no. Nov, pp. 2579-2605, 2008.
[38] 
M. Wattenberg, F. Viégas,  and I. Johnson, "How to use t-SNE effectively", Distill  vol. 1, no. 10, pp. e2, 2016..
[http://dx.doi.org/10.23915/distill.00002] 
[39] 
L. Van Der Maaten, "Accelerating t-SNE using tree-based algorithms", J. Mach. Learn. Res., vol. 15, no. 1, pp. 3221-3245, 2014.

Rights & Permissions Print Cite

Article Metrics

12

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1872212115666210208222231	Print ISSN 1872-2121
Publisher Name Bentham Science Publisher	Online ISSN 2212-4047

Recent Patents on Engineering

A Review of Clustering Algorithms: Comparison of DBSCAN and K-mean with Oversampling and t-SNE

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract