Abstract
Background: Data quality is crucial to the success of big data analytics. However, the presence of outliers affects data quality and data analysis. Employing effective outlier detection techniques to eliminate dirty data can improve data quality and garner more accurate analytical insights. Data uncertainty presents a significant challenge for outlier detection methods and warrants further refinement in the era of big data.
Objective: The unsupervised outlier detection based on the integration of clustering and outlier scoring scheme is the current research hotspot. However, hard clustering fails when dealing with abnormal patterns with uncertain and unexpected behavior. Rough boundaries help identify more accurate cluster structures. Therefore, this article uses uncertainty soft clustering based on rough set theory to extend the clustering technology and designs appropriate scoring schemes to capture abnormal instances. This solves the problem of outlier detection in uncertain and nonlinear complex data.
Methods: This paper proposes the flow of an outlier detection algorithm based on Kernel Rough Clustering and then compares the detection accuracy with five existing popular methods using synthetic and real-world datasets. The results show that the proposed method has higher detection accuracy.
Results: The detection precision and recall of the proposed method were improved. For the detection accuracy, it is superior to popular methods, indicating that the proposed method has a good detection effect in identifying outlier.
Conclusion: Compared with popular methods, the proposed method has a slight advantage in detection accuracy and is one of the effective algorithms that can be selected for outlier detection.
Graphical Abstract
[http://dx.doi.org/10.1016/j.neuroimage.2011.06.042] [PMID: 21723950]
[http://dx.doi.org/10.1016/j.neucom.2014.09.022]
[http://dx.doi.org/10.1016/j.ins.2022.11.154]
[http://dx.doi.org/10.1109/JCAI.2009.146]
[http://dx.doi.org/10.1109/TIE.2015.2417501]
[http://dx.doi.org/10.1145/1452520.1452539]
[http://dx.doi.org/10.1016/j.ins.2023.03.037]
[http://dx.doi.org/10.1007/BF01001956]
[http://dx.doi.org/10.1080/03081070701251182]
[http://dx.doi.org/10.3233/IDA-2009-0363]
[http://dx.doi.org/10.1109/FSKD.2009.227]
[http://dx.doi.org/10.1016/j.ins.2022.11.078]
[http://dx.doi.org/10.1016/S0167-8655(03)00003-5]
[http://dx.doi.org/10.1007/s10479-008-0371-9]
[http://dx.doi.org/10.1016/j.knosys.2017.01.013]
[http://dx.doi.org/10.1145/342009.335437]
[http://dx.doi.org/10.1007/978-3-642-01307-2_84]
[http://dx.doi.org/10.1145/1645953.1646195]
[http://dx.doi.org/10.1023/B:JIIS.0000029668.88665.1a]
[http://dx.doi.org/10.2174/2213275912666190716121431]