Feature Clustering and Ensemble Learning Based Approach for Software
Defect Prediction

R.       Srivastava; Aman    Kumar    Jain

doi:10.2174/2666255813999201109201259

Abstract

Objective: Defects in delivered software products not only have financial implications but also affect the reputation of the organisation and lead to wastage of time and human resources. This paper aims to detect defects in software modules.

Methods: Our approach sequentially combines SMOTE algorithm with K - means clustering algorithm to deal with class imbalance problem to obtain a set of key features based on the interclass and intra-class coefficient of correlation and ensemble modeling to predict defects in software modules. After cautious examination, an ensemble framework of XGBoost, Decision Tree, and Random Forest is used for the prediction of software defects owing to numerous merits of the ensembling approach.

Results: We have used five open-source datasets from NASA PROMISE repository for software engineering. The result obtained from our approach has been compared with that of individual algorithms used in the ensemble. A confidence interval for the accuracy of our approach with respect to performance evaluation metrics, namely accuracy, precision, recall, F1 score and AUC score, has also been constructed at a significance level of 0.01.

Conclusion: Results have been depicted pictographically.

Keywords: Software defects, feature selection, class imbalance, ensemble modelling, hard voting, confidence interval

Graphical Abstract

[1]
N. F. Schneidewind,  and H. M. Hoffmann, "Software root cause prediction using clustering techniques: A review", In 2015 Global Conference on Communication Technologies, p. pp. 511-515., 2015.
 [http://dx.doi.org/10.1109/TSE.1979.234188]
[2]
N.F. Schneidewind,  and H.M. Hoffmann, "An experiment in software error data collection and analysis", IEEE Trans. Softw. Eng., vol. SE-5, no. 3, pp. 276-286, 1979.
 [http://dx.doi.org/10.1109/TSE.1979.234188]
[3]
D. Potier, J. Albin, R. Ferreol,  and A. Bilodeau, "Experiments with computer software complexity and reliability", In Proceedings of the 6th international conference on Software engineering, 1991, pp. 94-103 
[4]
T. Nakajo,  and H. Kume, "A case history analysis of software error cause-effect relationships", IEEE Trans. Softw. Eng., vol. 8, pp. 830-838, 1991.
[5]
N. Japkowicz,  and S. Stephen, "The class imbalance problem: A systematic study", Intell. Data Anal., vol. 6, no. 5, pp. 429-449, 2002.
 [http://dx.doi.org/10.3233/IDA-2002-6504]
[6]
N. Japkowicz, "The class imbalance problem: Significance and strategies", In Proceedings of the 2000 International Conference on Artifi-cial Intelligence, vol. Vol. 56, 2000, pp. 111-117 
[7]
R. Longadge,  and S. Dongre, "Class imbalance problem in data mining review", arXiv:1305.1707. 2013
[8]
G.V. Trunk, "A problem of dimensionality: A simple example", IEEE Trans. Pattern Anal. Mach. Intell., vol. 1, no. 3, pp. 306-307, 1979.
 [http://dx.doi.org/10.1109/TPAMI.1979.4766926] [PMID:  21868861]
[9]
F.R. Tangherlini, "Schwarzschild field inn dimensions and the dimensionality of space problem", Il Nuovo Cimento., p. Vol. 27, No. 3, pp. 636 -651, 1963, .
[10]
Y. Liu, N.V. Chawla, M.P. Harper, E. Shriberg,  and A. Stolcke, "A study in machine learning from imbalanced data for sentence bounda-ry detection in speech", Comput. Speech Lang., vol. 20, no. 4, pp. 468-494, 2006.
 [http://dx.doi.org/10.1016/j.csl.2005.06.002]
[11]
R.A. Johnson, N.V. Chawla,  and J.J. Hellmann, "Species distribution modeling and prediction: A class imbalance problem", In 2012 Conference on Intelligent Data Understanding, 2012, pp. 9-16 
 [http://dx.doi.org/10.1109/CIDU.2012.6382186]
[12]
A. Fallahi,  and S. Jafari, "An expert system for detection of breast cancer using data preprocessing and bayesian network", Int. J. Adv. Sci. Technol., vol. 34, pp. 65-70, 2011.
[13]
S.R. Safavian,  and D. Landgrebe, "A survey of decision tree classifier methodology", IEEE Trans. Syst. Man Cybern., vol. 21, no. 3, pp. 660-674, 1991.
[14]
M. Pal, "Random forest classifier for remote sensing classification", Int. J. Remote Sens., vol. 26, no. 1, pp. 217-222, 2005.
[15]
R. Díaz-Uriarte,  and S. Alvarez de Andrés, "Gene selection and classification of microarray data using random forest", BMC Bioinformatics, vol. 7, no. 1, p. 3, 2006.
 [http://dx.doi.org/10.1186/1471-2105-7-3] [PMID:  16398926]
[16]
T. Chen,  and C. Guestrin, "Xgboost: A scalable tree boosting system", In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785-794 
 [http://dx.doi.org/10.1145/2939672.2939785]
[17]
T. Chen, T. He, M. Benesty, V. Khotilovich,  and Y. Tang, "
"Xgboost: Extreme gradient boosting", R Package Version 0.4-2, Vol. 1, No. 4, p. pp. 1-4, 2015, .
[18]
T. Dietterich, "Ensemble learning. The handbook of brain theory and neural networks", Arbib MA., 2002.
[19]
H. Zhang, D. Liu, Y. Luo,  and D. Wang, "Adaptive dynamic programming for control: Algorithms and stability, 2013th Ed. London, England: Springer,",  2015
[20]
N. Jamali,  and C. Sammut, "Majority voting: Material classification by tactile sensing using surface texture", IEEE Trans. Robot., vol. 27, no. 3, pp. 508-521, 2011.
 [http://dx.doi.org/10.1109/TRO.2011.2127110]
[21]
T.J. McCabe, "A complexity measure", IEEE Trans. Softw. Eng., no. 4, pp. 308-320, 1976.
 [http://dx.doi.org/10.1109/TSE.1976.233837]
[22]
"M. H. Halstead, Elements of Software Science. London, England: Elsevier Science,",  1977
[23]
S.R. Chidamber,  and C.F. Kemerer, "A metrics suite for object oriented design", IEEE Trans. Softw. Eng., vol. 20, no. 6, pp. 476-493, 1994.
 [http://dx.doi.org/10.1109/32.295895]
[24]
X. Yang, D. Lo, X. Xia, Y. Zhang,  and J. Sun, "Deep learning for just-in-time defect prediction", In 2015 IEEE International Conference on Software Quality, Reliability and Security, 2015, pp. 17-26 
 [http://dx.doi.org/10.1109/QRS.2015.14]
[25]
Y. Kamei, E. Shihab, B. Adams, A.E. Hassan, A. Mockus, A. Sinha,  and N. Ubayashi, "A large-scale empirical study of just-in-time qual-ity assurance", IEEE Trans. Softw. Eng., vol. 39, no. 6, pp. 757-773, 2012.
 [http://dx.doi.org/10.1109/TSE.2012.70]
[26]
S. Wang, T. Liu,  and L. Tan, "Automatically learning semantic features for defect prediction", In", 2016 IEEE/ACM 38th International Conference on Software Engineering, p. pp. 297-308, 2016.
 [http://dx.doi.org/10.1145/2884781.2884804]
[27]
J. Li, P. He, J. Zhu,  and M.R. Lyu, "Software defect prediction via convolutional neural network", In 2017 IEEE International Conference on Software Quality, Reliability and Security, 2017, pp. 318-328 
 [http://dx.doi.org/10.1109/QRS.2017.42]
[28]
P.D. Singh,  and A. Chug, "Software defect prediction analysis using machine learning algorithms", In 7th International Conference on Cloud Computing, Data Science Engineering-Confluence, 2017, pp. 775-781 
[29]
S. Patil, A.N. Rao,  and C.S. Bindu, "Semi-supervised machine learning and adaptive data clustering approach for software defect predic-tion", Int. J. Simul. Syst. Sci. Technol, vol. 20, no. 1, 2019.
[30]
J. Zheng, "Cost-sensitive boosting neural networks for software defect prediction", Expert Syst. Appl., vol. 37, no. 6, pp. 4537-4543, 2010.
 [http://dx.doi.org/10.1016/j.eswa.2009.12.056]
[31]
X. Yang, K. Tang,  and X. Yao, "A learning-to-rank approach to software defect prediction", IEEE Trans. Reliab., vol. 64, no. 1, pp. 234-246, 2014.
 [http://dx.doi.org/10.1109/TR.2014.2370891]
[32]
L. Pelayo,  and S. Dick, "Applying novel resampling strategies to software defect prediction", in NAFIPS 2007-2007 Annual meeting of the North American fuzzy information processing society, p. pp. 69-72, 2007.
 [http://dx.doi.org/10.1109/NAFIPS.2007.383813]
[33]
R. Jindal, R. Malhotra,  and A. Jain, "Software defect prediction using neural networks", In Proceedings of 3rd International Conference on Reliability, Infocom Technologies and Optimization, 2014, pp. 1-6 
[34]
Z. Li, X.Y. Jing,  and X. Zhu, "Progress on approaches to software defect prediction", IET Softw., vol. 12, no. 3, pp. 161-175, 2018.
 [http://dx.doi.org/10.1049/iet-sen.2017.0148]
[35]
X. Cai, Y. Niu, S. Geng, J. Zhang, Z. Cui, J. Li,  and J. Chen, "An undersampled software defect prediction method based on hybrid multi objective cuckoo search", Concurr. Comput., vol. 32, no. 5, p. 5478, 2020.
 [http://dx.doi.org/10.1002/cpe.5478]
[36]
D. Tripathi, D.R. Edla, V. Kuppili, A. Bablani,  and R. Dharavath, "Credit scoring model based on weighted voting and cluster based fea-ture selection", Procedia Comput. Sci., vol. 132, pp. 22-31, 2018.
 [http://dx.doi.org/10.1016/j.procs.2018.05.055]
[37]
T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters,  and B. Turhan, ""The promise repository of empirical software engineering data 2012", 

Rights & Permissions Print Cite

Article Metrics

11

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/2666255813999201109201259	Print ISSN 2666-2558
Publisher Name Bentham Science Publisher	Online ISSN 2666-2566

Recent Advances in Computer Science and Communications

Feature Clustering and Ensemble Learning Based Approach for Software Defect Prediction

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Related Articles

Abstract