Abstract
Introduction: Incomplete data sets containing some missing attributes is a prevailing problem in many research areas. The reasons for the lack of missing attributes may be several; human error in tabulating/recording the data, machine failure, errors in data acquisition or refusal of a patient/customer to answer few questions in a questionnaire or survey. Further, clustering of such data sets becomes a challenge.
Objective: In this paper, we presented a critical review of various methodologies proposed for handling missing data in clustering. The focus of this paper is the comparison of various imputation techniques based FCM clustering and the four clustering strategies proposed by Hathway and Bezdek.
Methods: In this paper, we imputed the missing values in incomplete datasets by various imputation/ non-imputation techniques to complete the data set and then conventional fuzzy clustering algorithm is applied to get the clustering results.
Results: Experiments on various synthetic data sets and real data sets from UCI repository are carried out. To evaluate the performance of the various imputation/ non-imputation based FCM clustering algorithm, several performance criteria and statistical tests are considered. Experimental results on various data sets show that the linear interpolation based FCM clustering performs significantly better than other imputation as well as non-imputation techniques.
Conclusion: It is concluded that the clustering algorithm is data specific, no clustering technique can give good results on all data sets. It depends upon both the data type and the percentage of missing attributes in the dataset. Through this study, we have shown that the linear interpolation based FCM clustering algorithm can be used effectively for clustering of incomplete data set.
Keywords: FCM Clustering, incomplete data sets, imputation, missing data, regression, interpolation.
Graphical Abstract
[http://dx.doi.org/10.1080/01969727308546046]
[http://dx.doi.org/10.1007/978-1-4757-0450-1]
[http://dx.doi.org/10.1016/j.jclinepi.2006.01.014] [PMID: 16980149]
[http://dx.doi.org/10.1109/ITI.2006.1708480]
[http://dx.doi.org/10.1109/ITI.2008.4588437]
[http://dx.doi.org/10.1186/s12874-015-0022-1] [PMID: 25880850]
[http://dx.doi.org/10.1186/s12874-015-0048-4] [PMID: 26216355]
[http://dx.doi.org/10.5120/7941-1102]
[http://dx.doi.org/10.1186/s12874-017-0414-5] [PMID: 28877666]
[http://dx.doi.org/10.1016/j.atmosenv.2004.02.026]
[http://dx.doi.org/10.4028/www.scientific.net/MSF.803.278]
[http://dx.doi.org/10.17576/jsm-2015-4403-17]
[http://dx.doi.org/10.1002/env.2426]
[http://dx.doi.org/10.1109/ICIA.2006.305793]
[http://dx.doi.org/10.1109/ISESE.2005.1541819]
[http://dx.doi.org/10.1080/00031305.2015.1086685]
[http://dx.doi.org/10.1109/FUZZY.2011.6007312]
[http://dx.doi.org/10.1109/3477.956035] [PMID: 18244838]
[http://dx.doi.org/10.1023/B:NEPL.0000011135.19145.1b]
[http://dx.doi.org/10.1016/j.neucom.2017.01.017]
[http://dx.doi.org/10.1007/3-540-44967-1_42]
[http://dx.doi.org/10.1016/j.ijar.2003.08.004]
[http://dx.doi.org/10.1007/978-3-642-14049-5_7]
[http://dx.doi.org/10.1109/ICDIM.2010.5664691]
[http://dx.doi.org/10.1155/2014/430814]
[http://dx.doi.org/10.1016/j.knosys.2016.01.048]
[http://dx.doi.org/10.1155/2016/4321928]
[http://dx.doi.org/10.4097/kjae.2013.64.5.402]
[http://dx.doi.org/10.2307/3316009]
[http://dx.doi.org/10.1007/s001800200103]
[http://dx.doi.org/10.1186/s12911-016-0318-z] [PMID: 27454392]
[http://dx.doi.org/10.1117/12.654109]
[http://dx.doi.org/10.3997/2214-4609.20142491]
[http://dx.doi.org/10.1016/j.smhl.2017.04.002] [PMID: 28993813]
[http://dx.doi.org/10.1002/sim.4067] [PMID: 21225900]
[http://dx.doi.org/10.1109/TSMC.1979.4310090]
[http://dx.doi.org/10.1109/91.784206]
[http://dx.doi.org/10.1080/01621459.1971.10482356]
[http://dx.doi.org/10.1007/BF01908075]