Abstract
Background: Presence of missing values in databases causes serious threats for knowledge extraction. Especially in large databases which are integrated from multiple sources, the number of missing values may be high which in turn may lead to biased inferences. Many methods have been proposed by researchers for handling ignorable (Missing At Random and Missing Completely At Random) and non-ignorable missingness (Not Missing At Random). Still, there exists gap in (i) handling heterogeneous missing attributes (ii) imputing missing values in large databases and (iii) dealing with both ignorable and non-ignorable missingness.
Objective: This paper addresses all these three issues by proposing a single algorithm called as Repopulated Bayesian Ant Colony Optimization (RPBACO) by hybridizing Bayesian and Ant Colony Optimization (ACO) techniques.
Methodology: ACO chooses the right covariate values required for optimal imputation of missing values based on the probability and pheromone updation values of ants. Bayesian principles are used to evaluate fitness of solutions in the ACO process which involves local beam search for repopulating in successive generations. RPBACO is implemented on large real datasets for imputing heterogeneous (discrete and continuous) missing values with both ignorable and non-ignorable patterns.
Results: The experimental results are encouraging when compared with other existing standard techniques in terms of both imputation accuracy and computational time calculated at different missing rates from 5% to 50%. The statistical tests conducted to validate the experimental results also prove the superiority of RPBACO in all the datasets considered.
Conclusion: RPBACO can be successfully used for handling both ignorable and non-ignorable missing values in heterogeneous attributes in large datasets with better imputation accuracy.
Keywords: Ant colony optimization, ignorable missingness, non-ignorable missingness, repopulated bayesian ant colony optimization, bayesian principles, local beam search.
Graphical Abstract