Investigating Outlier Detection Techniques Based on Kernel Rough
Clustering

Wang      Meng; Cao      Wenhang; Dui      Hongyan

doi:10.2174/2666255816666230912153541

Abstract

Background: Data quality is crucial to the success of big data analytics. However, the presence of outliers affects data quality and data analysis. Employing effective outlier detection techniques to eliminate dirty data can improve data quality and garner more accurate analytical insights. Data uncertainty presents a significant challenge for outlier detection methods and warrants further refinement in the era of big data.

Objective: The unsupervised outlier detection based on the integration of clustering and outlier scoring scheme is the current research hotspot. However, hard clustering fails when dealing with abnormal patterns with uncertain and unexpected behavior. Rough boundaries help identify more accurate cluster structures. Therefore, this article uses uncertainty soft clustering based on rough set theory to extend the clustering technology and designs appropriate scoring schemes to capture abnormal instances. This solves the problem of outlier detection in uncertain and nonlinear complex data.

Methods: This paper proposes the flow of an outlier detection algorithm based on Kernel Rough Clustering and then compares the detection accuracy with five existing popular methods using synthetic and real-world datasets. The results show that the proposed method has higher detection accuracy.

Results: The detection precision and recall of the proposed method were improved. For the detection accuracy, it is superior to popular methods, indicating that the proposed method has a good detection effect in identifying outlier.

Conclusion: Compared with popular methods, the proposed method has a slight advantage in detection accuracy and is one of the effective algorithms that can be selected for outlier detection.

Graphical Abstract

[1]
B. Ouyang, Y. Song, Y. Li, S. Gaurav,  and B. Mathieu, "EBOD: An ensemble-based outlier detection algorithm for noisy datasets", Knowl. Bas. Sys., vol. 231, p. 107400, 2021.
[2]
H.J. Escalante, "A comparison of outlier detection algorithms for machine learning", Proc. Int. Conf. Commun. Comp., Nevade, USA, 2005, pp. 228-237.
[3]
J. Mourão-Miranda, D.R. Hardoon, T. Hahn, A.F. Marquand, S.C.R. Williams, J. Shawe-Taylor,  and M. Brammer, "Patient classification as an outlier detection problem: An application of the one-class support vector machine", Neuroimage, vol. 58, no. 3, pp. 793-804, 2011.
 [http://dx.doi.org/10.1016/j.neuroimage.2011.06.042] [PMID:  21723950]
[4]
K. Zhang,  and M. Luo, "Outlier-robust extreme learning machine for regression problems", Neurocomputing, vol. 151, pp. 1519-1527, 2015.
 [http://dx.doi.org/10.1016/j.neucom.2014.09.022]
[5]
L. Gao, M. Cai,  and Q. Li, "A relative granular ratio-based outlier detection method in heterogeneous data", Inf. Sci., vol. 622, pp. 710-731, 2023.
 [http://dx.doi.org/10.1016/j.ins.2022.11.154]
[6]
W. Yu,  and W. Na, "Research on credit card fraud detection model based on distance sum", In International Joint Conference on Artificial Intelligence, Hainan, China, 2009, pp. 353-356. 
 [http://dx.doi.org/10.1109/JCAI.2009.146]
[7]
S. Axelsson, Intrusion detection systems: A taxonomy survey. Technical Report, 2000, pp. 99-15.
[8]
Z. Gao, C. Cecati,  and S.X. Ding, "A survey of fault diagnosis and fault-tolerant techniques-part I:fault diagnosis with model-based and signal-based approaches", IEEE Trans. Ind. Electron., vol. 62, no. 6, pp. 3757-3767, 2015.
 [http://dx.doi.org/10.1109/TIE.2015.2417501]
[9]
K.D. Borne,  and A. Vedachalam, Surprise detection in multivariate astronomical data., Springer: Berlin, 2012, pp. 275-289.
[10]
G. Nychis, V. Sekar,  and D.G. Andersen, "An empirical evaluation of entropy-based traffic anomaly detection", In Eighth ACM SIGCOMM conference on Internet measurement, Connecticut, USA, 2008, pp. 151-156. 
 [http://dx.doi.org/10.1145/1452520.1452539]
[11]
R. Li, H. Chen, S. Liu, X. Li, Y. Li,  and B. Wang, "Incomplete mixed data-driven outlier detection based on local–global neighborhood information", Inf. Sci., vol. 633, pp. 204-225, 2023.
 [http://dx.doi.org/10.1016/j.ins.2023.03.037]
[12]
Z. Pawlak, "Rough sets", Int. J. Comp. Inform. Sci., vol. 11, no. 5, pp. 341-356, 1982.
 [http://dx.doi.org/10.1007/BF01001956]
[13]
F. Jiang, Y. Sui,  and C. Cao, "A rough set approach to outlier detection", Int. J. Gen. Syst., vol. 37, no. 5, pp. 519-536, 2008.
 [http://dx.doi.org/10.1080/03081070701251182]
[14]
F. Shaari, A.A. Bakar,  and A.R. Hamdan, "Outlier detection based on rough sets theory", Intell. Data Anal., vol. 13, no. 2, pp. 191-206, 2009.
 [http://dx.doi.org/10.3233/IDA-2009-0363]
[15]
Z. Xue,  and S. Liu, "Rough-based semi-supervised outlier detection", In Sixth International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China, 2009, pp. 520-523. 
 [http://dx.doi.org/10.1109/FSKD.2009.227]
[16]
Q. Hu, Z. Yuan, K. Qin,  and J. Zhang, "A novel outlier detection approach based on formal concept analysis", Knowl. Bas. Sys., vol. 268, pp. 110-486, 2023.
[17]
W. Ke, J. Wei, N. Xiong,  and Q. Hou, "GSS: A group similarity system based on unsupervised outlier detection for big data computing", Inf. Sci., vol. 620, pp. 1-15, 2023.
 [http://dx.doi.org/10.1016/j.ins.2022.11.078]
[18]
W. Hongzhi, B. Mohamed Jaward,  and H. Mohamed, "Progress in outlier detection techniques: A survey", IEEE Access, vol. 7, pp. 107964-108000, 2019.
[19]
M. Ester, H-P. Kriegel, J. Sander,  and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise", Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Kyoto, Japan, 1996, pp. 226-231.
[20]
Z. He, X. Xu,  and S. Deng, "Discovering cluster-based local outliers", Pattern Recognit. Lett., vol. 24, no. 9-10, pp. 1641-1650, 2003.
 [http://dx.doi.org/10.1016/S0167-8655(03)00003-5]
[21]
L. Duan, L. Xu, Y. Liu,  and J. Lee, "Cluster-based outlier detection", Ann. Oper. Res., vol. 168, no. 1, pp. 151-168, 2009.
 [http://dx.doi.org/10.1007/s10479-008-0371-9]
[22]
J. Huang, Q. Zhu, L. Yang, D. Cheng,  and Q. Wu, "A novel outlier cluster detection algorithm without top-n parameter", Knowl. Base. Syst., vol. 121, pp. 32-40, 2017.
 [http://dx.doi.org/10.1016/j.knosys.2017.01.013]
[23]
A. Nowak-Brzezi’nskaa,  and C. Hory’n, "Outliers in rules-the comparision of lof, cof and kmeans algorithms", 24th Int. Conf. Knowl. Base. Intell. Inform. Eng. Sys., vol. 176, pp. 1420-1429, 2020.
[24]
A. Mohiuddin,  and M. Abdun Naser, "A novel approach for outlier detection and clustering improvement", IEEE 8 th Conference on In-dustrial Electronics and Applications (ICIEA), Melbourne, Australia, 2013, pp. 577-582.
[25]
A.K. Jain,  and R.C. Dubes, "Algorithms for clustering data", Technometrics, vol. 32, no. 2, pp. 227-229, 1988.
[26]
S. Ramaswamy, R. Rastogi,  and K. Shim, "Efficient algorithms for mining outliers from large data sets", In: ACM SIGMOD international conference on Management of data, Dallas Texas, USA, 2000, pp. 427-438.
 [http://dx.doi.org/10.1145/342009.335437]
[27]
F. Angiulli,  and C. Pizzuti, "Fast outlier detection in high dimensional spaces", In Sixth European Conference on Principles of Data Mining and Knowledge Discovery, Helsinki, Finland, 2002, pp. 15-26 
[28]
K. Zhang, M. Hutter,  and H. Jin, "A new local distance-based outlier detection approach for scattered real-world data", In 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Bangkok, Thailand, 2009, pp. 813-822. 
 [http://dx.doi.org/10.1007/978-3-642-01307-2_84]
[29]
B. Yu, M. Song,  and L. Wang, "Local isolation coefficient-based outlier mining algorithm", In Second International Conference on IEEE Information Technology and Computer Science, Kiev, Ukraine, 2009, pp. 448-451. 
[30]
M.M. Breunig, H.P. Kriegel,  and R.T. Ng, "Lof: Identifying density-based local outliers", Proc. ACM. SIGMOD. Record., pp. 93-104, 2000.
[31]
T. Jian, Z. Chen,  and A.W.C. Fu, "Enhancing effectiveness of outlier detections for low density patterns", In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, China, 2002, pp. 535-548. 
[32]
J. Wen, A.K. Tung,  and J. Han, "Ranking outliers using symmetric neighborhood relationship", In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 2006, pp. 577-593. 
[33]
H.P. Kriegel, P. Krger,  and E. Schubert, "Loop: Local outlier probabilities", In ACM Conference on Information and Knowledge Management, Hong Kong, China, 2009, pp. 1649-1652. 
 [http://dx.doi.org/10.1145/1645953.1646195]
[34]
A. Taylor, "Identifying organisms for production using unsupervised parameter learning for outlier detection", US Patent 11574153, February 7, 2023.
[35]
I. Iryna Vogler,  and M. Iman, "Machine learning-based data analyses for outlier detection", US Patent 11537942, December 12 2022.
[36]
J.I.A. Yuting,  and N. Jayaram, "Machine learning outlier detection using weighted histogram-based outlier scoring (W-HBOS)", US Patent 20220101069, March 31, 2022.
[37]
P. Lingras,  and C. West, "Interval set clustering of web users with rough k-means", J. Intell. Inf. Syst., vol. 23, no. 1, pp. 5-16, 2004.
 [http://dx.doi.org/10.1023/B:JIIS.0000029668.88665.1a]
[38]
W. Meng, D. Hongyan, Z. Shiyuan, D. Zhankui,  and W. Zige, "The kernel rough k-means algorithm", Rec. Adv. Comp. Sci. Commun., vol. 13, no. 2, pp. 234-239, 2020.
 [http://dx.doi.org/10.2174/2213275912666190716121431]

Rights & Permissions Print Cite

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/2666255816666230912153541	Print ISSN 2666-2558
Publisher Name Bentham Science Publisher	Online ISSN 2666-2566

Recent Advances in Computer Science and Communications

Investigating Outlier Detection Techniques Based on Kernel Rough Clustering

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract