An Integrated Approach of Proposed Pruning Based Feature Selection Technique (PBFST) for Phishing E-mail Detection

Hari    Shanker    Hota; Dinesh       Sharma; Akhilesh       Shrivas

doi:10.2174/2666255814666210322162129

Abstract

Background: The entire world is shifting towards electronic communication through Email for fast and secure communication. Millions of people, including organization, government, and others, are using Email services. This growing number of Email users are facing problems; therefore, detecting phishing Email is a challenging task, especially for non-IT users. Automatic detection of phishing Email is essential to deploy along with Email software. Various authors have worked in the field of phishing Email classification with different feature selection and optimization techniques for better performance.

Objectives: This paper attempts to build a model for the detection of phishing Email using data mining techniques. This paper's significant contribution is to develop and apply Feature Selection Technique (FST) to reduce features from the phishing Email benchmark data set.

Methods: The proposed Pruning Based Feature Selection Technique (PBFST) is used to determine the rank of feature based on the level of the tree where feature exists. The proposed algorithm is integrated with already developed Bucket Based Feature Selection Technique (BBFST). BBFST is used as an internal part to rank features in a particular level of the tree.

Results: Experimental work was carried out with open source WEKA data mining software using a 10-fold cross-validation technique. The proposed FST was compared with other ranking based FSTs to check the performance of C4.5 classifier with Phishing Email data set.

Conclusion: The proposed FST reduces 33 features out of 47 features which exist in phishing Email dataset and C4.5 algorithm produces remarkable accuracy of 99.06% with only 11 features and it has been found to be better than other existing FSTs.

Keywords: Phishing e-mail detection, Pruning Based Feature Selection Technique (PBFST), classification, Decision Tree(DT), gain ratio, data mining.

Graphical Abstract

[1] 
W. Hadi, F. Aburub,  and S. Alhawari, "A New Fast Associative Classification Algorithm for Detecting Phishing Websites", Applied Soft Computing, Elsevier, vol. 48, pp. 729-734, 2016.
[http://dx.doi.org/10.1016/j.asoc.2016.08.005] 
[2] 
"Phishing Website Data Set 2016", http://kdd.cs.uci.edu/databases/
[3] 
N.A. Gamagedara Arachchilage, S. Love,  and K. Beznosov, "Phishing Threat Avoidance Behaviour: An Empirical Investigation", Computers in Human Behavior, Elsevier, vol. 60, pp. 185-197, 2016.
[http://dx.doi.org/10.1016/j.chb.2016.02.065] 
[4] 
N.M. Shekokar, C. Shah, M. Mahajan,  and S. Rachh, "An Ideal Approach for Detection and Prevention of Phishing Attacks", Procedia Computer Science, Elsevier, vol. 49, pp. 82-91, 2015.
[http://dx.doi.org/10.1016/j.procs.2015.04.230] 
[5] 
K. Parsons, A. McCormac, M. Pattinson, M. Butaviciusa,  and C. Jerram, "The Design of Phishing Studies: Challenges for Researchers", Computers and Security, Elsevier., vol. xxx, pp. 1-13, 2015.
[http://dx.doi.org/10.1016/j.cose.2015.02.008] 
[6] 
I. Rahmi A Hamid,  and J.H. Abawajy, "An Approach for Profiling Phishing Activities", Comput. Secur., 2014.
[7] 
N. Abdelhamid, A. Ayesh,  and F. Thabtah, "Phishing Detection based Associative Classification Data Mining", Expert Systems with Applications, Elsevier, vol. 41, pp. 5948-5959, 2014.
[http://dx.doi.org/10.1016/j.eswa.2014.03.019] 
[8] 
"Web Source", http://www.cs.waikato.ac.nz/~ml/weka/
[9] 
"Web Source", http://khonji.org/phishing_studies.html
[10] 
N.A. Gamagedara Arachchilag,  and S. Love, "A Game Design Framework for Avoiding Phishing Attacks", Computers in Human Behavior, Elsevier, vol. 29, pp. 706-771, 2013.
[http://dx.doi.org/10.1016/j.chb.2012.12.018] 
[11] 
P.A. Barraclough, M.A. Hossain, M.A. Tahir, G. Sexton,  and N. Aslam, "Intelligent Phishing Detection and Protection Scheme for Online Transactions", Expert Systems with Applications, Elsevier, vol. 40, pp. 4697-4706, 2013.
[http://dx.doi.org/10.1016/j.eswa.2013.02.009] 
[12] 

J. Han, and M. Kamber, Data Mining Concepts and Techniques., 2nd ed Morgan Kaufmann: San Francisco, 2006.
[13] 
R. Islam,  and J. Abawajy, "A Multi-Tier Phishing Detection and Filtering Approach", J. Netw. Comput. Appl., vol. 36, pp. 324-335, 2013.
[http://dx.doi.org/10.1016/j.jnca.2012.05.009] 
[14] 
C.K. Olivo, A.O. Santin,  and L.S. Oliveira, Obtaining the Threat Model for E-Mail Phishing.Applied Soft Computing., vol. Vol. xxx.  Elsevier, 2011.
[15] 
X. Chen, I. Bose, A.C. Man Leung,  and C. Guo, "Assessing The Severity of Phishing Attacks: A Hybrid Data Mining Approach", Decision Support Systems, Elsevier, vol. 50, pp. 662-672, 2011.
[http://dx.doi.org/10.1016/j.dss.2010.08.020] 
[16] 
M. Aburrousa, M.A. Hossain, K. Dahal,  and F. Thabtah, "Intelligent Phishing Detection System for E-Banking Using Fuzzy Data Mining", Expert Systems with Applications, Elsevier, vol. 37, pp. 7913-7921, 2010.
[http://dx.doi.org/10.1016/j.eswa.2010.04.044] 
[17] 
L. Wenyin, N. Fang, X. Quan, B. Qiu,  and G. Liu, "Discovering Phishing Target Based on Semantic Link Network", Future Generation Computer Systems, Elsevier, vol. 26, pp. 381-388, 2010.
[http://dx.doi.org/10.1016/j.future.2009.07.012] 
[18] 
P. Likarish, D. Dunbar,  and T. E. Hansen, B-APT: Bayesian Anti-Phishing Toolbar IEEE Communications Society Subject Matter experts for Publication in the ICC 2008 Proceedings,, 2008.
[http://dx.doi.org/10.1109/ICC.2008.335] 
[19] 
V. Shreeram, M. Suban, P. Shanthi,  and K. Manjula, Anti-Phishing Detection of Phishing Attacks Using Genetic Algorithm., IEEE, 2010, pp. 447-450.
[http://dx.doi.org/10.1109/ICCCCT.2010.5670593] 
[20] 
I. Rahmi, A. Hamid,  and A. Jemal, "Phishing E-mail Feature Selection Approach", In:  2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11,, 2011, pp. 916-921.
[21] 
T.C. Wan, " A. ALmomani, A. A.Manasrah, E. Altaher, K Almomani,  A. Al-Saedi, S. ALnajjar, and A. Ramadas, "A Survey of Learning  Based Techniques of Phishing E-Mail Filtering", International  Journal of Digital Content Technology and its ", Applications  (JDCTA), vol. Vol. 6, pp, pp. 119-129, 2012.
[22] 
H.S. Hota, A.K. Shrivas,  and R. Hota, "A Proposed Bucket Based Feature Selection Technique (BBFST) for Phishing E-mail Classification", Advances in Intelligent Systems and Computing, vol. 519, pp. 189-194, 2016.
[23] 
" Anti Phishing Working Group (APWG)", https://www.apwg.org/resources/apwg-reports/
[24] 
H.Y.A. Abutair,  and A. Belghith, "Using Case-Based Reasoning for Phishing Detection", Procedia Computer Science, Elsevier, vol. 109C, pp. 281-288, 2017.
[http://dx.doi.org/10.1016/j.procs.2017.05.352] 
[25] 
A.K. Jain,  and B.B. Gupta, "Phishing Detection: Analysis of Visual Similarity Based Approaches", Secur. Commun. Netw., pp. 1-20, 2017.
[http://dx.doi.org/10.1155/2017/5421046] 
[26] 
G. Sonowal,  and K.S. Kuppusamy, PhiDMA - A Phishing Detection Model with Multi-filter Approach Journal of King Saud University-Computer and Information Sciences, pp., pp. 1-18, 2017.
[27] 
J. Mao, J. Bian, W. Tian, S. Zhu, T. Wei, A. Li,  and Z. Liang, "Detecting Phishing Websites via Aggregation Analysis of Page Layouts", Procedia Comput. Sci., vol. 129, pp. 224-230, 2018.
[http://dx.doi.org/10.1016/j.procs.2018.03.053] 
[28] 
Y. Ding, K. Li Nurbol,  and W. Slamu, "A Keyword-based Combination Approach for Detecting Phishing Web pages", Comput. Secur., vol. 84, pp. 256-275, 2019.
[http://dx.doi.org/10.1016/j.cose.2019.03.018] 
[29] 
R.S. Rao,  and A.R. Pais, "Jail-Phish: An Improved Search Engine Based Phishing Detection System", Comput. Secur., vol. 83, pp. 246-267, 2019.
[30] 
D. Aksu, A. Abdulwakil,  and M. Ali Ayd,  “Detecting Phishing Websites Using Support Vector Machine Algorithm”., Press Academia Procedia, pp. 139-142, 2017.
[31] 
S.W. Liew, N.M. Sani, M.T. Abdullah, R. Razali Yaakob,  and M.Y. Sharum, "An Effective Security Alert Mechanism for Real-Time Phishing Tweet Detection On Twitter", Computers and Security, Elsevier, vol. 83, pp. 201-207, 2019.
[http://dx.doi.org/10.1016/j.cose.2019.02.004] 
[32] 
M.N. Marsono, M.W. El-Kharashi,  and F. Gebali, Binary LNS-based Naive Bayes Hardware Classifier for Spam Control.
[http://dx.doi.org/10.1109/ISCAS.2006.1693424] 
[33] 
L. Shi,  and Q. Wang, "MA. X, M. Wang, and H Qiao, “Spam E-mail Classification Using Decision Tree Ensemble”", J. Comput. Inf. Syst., vol. 8, pp. 949-956, 2012.
[34] 
L. Firte, C. Lemnaru,  and R. Potolea, "Spam Detection using KNN Algorithm and Resampling", In: In Intelligent Computer Communication and Processing, 2010 IEEE international Conference, 2010, pp. 27-33.
[http://dx.doi.org/10.1109/ICCP.2010.5606466] 
[35] 
O. Amayri,  and N. Bouguila, "A Study of Spam Filtering Using Support Vector Machine", Artif Intell Rev, Springer, vol. 34, pp. 73-108, 2010.
[http://dx.doi.org/10.1007/s10462-010-9166-x] 
[36] 
M. Khonji, A. Jones,  and Y. Iraqi, "A Study of Feature Subset Evaluators and Feature Subset Searching Methods for Phishing Classification", Proc. 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, 2011 
[37] 
J. Wang, Data Mining: Opportunities and Challenges., IGI Global, 2003.
[http://dx.doi.org/10.4018/978-1-59140-051-6] 
[38] 
A. K. Pujari, Data Mining Techniques.
[39] 
H.S. Hota, A.K. Shrivas,  and R. Hota, "An Ensemble Model for Detecting Phishing Attack with Proposed Remove-Replace Feature Selection Technique", Procedia of Computer science of International conference on Computational Intelligence and Data Science (ICCIDS 2018),, vol. 132, 2018pp. 900-907 
[http://dx.doi.org/10.1016/j.procs.2018.05.103] 
[40] 
H.S. Hota, K. Dinesh,  and Sharma A.K., "Shrivas, “Development of an Efficient Classifier using Proposed Sensitivity-based Feature Selection Technique for Intrusion Detection Technique”", Int. J. Inform. Comput. Secur., vol. 10, no. 1, pp. 80-101, 2018.
[http://dx.doi.org/10.1504/IJICS.2018.089594] 
[41] 
B.B. Gupta, N.A.G. Arachchilage,  and K.E. Psannis, "Defending Against Phishing Attacks: Taxonomy of Methods, Current Issues and Future Directions", Telecomm. Syst., vol. 67, no. 2, pp. 247-267, 2017.
[http://dx.doi.org/10.1007/s11235-017-0334-z] 
[42] 
A.N. Joshi,  and T.R. Pattanshetti, "Phishing Attack Detection using Feature Selection Techniques", International Conference on Communication and Information Processing, Available on: Elsevier-SSRN, 2019pp. 1-7 
[http://dx.doi.org/10.2139/ssrn.3418542] 
[43] 
Z. Yang, C. Qiao, W. Kan,  and J. Qiu, "Phishing Email Detection Based on Hybrid Features", In:  IOP Conf. Series: Earth and Environmental Science, vol. Vol. 252, 2019, pp. 1-10.
[http://dx.doi.org/10.1088/1755-1315/252/4/042051] 
[44] 
G. Yu, W. Fan,  and W. Huang, "An Explainable Method of Phishing E-mails Generation and Its Application in Machine Learning", Electronic and Automation Control Conference, 2020pp. 1279-1283 
[45] 
A.A. Akinyelu,  and A.O. Adewumi, "Classification of Phishing Email Using Random Forest Machine Learning Technique", J. Appl. Math., vol. 2014, pp. 1-6, 2014.
[http://dx.doi.org/10.1155/2014/425731] 
[46] 
P. Saravanan,  and S. Subramanian, "A Framework for Detecting Phishing Websites using GA based Feature Selection and ARTMAP based Website Classification", Procedia Computer Science,Elsevier, vol. 171, pp. 1083-1092, 2020.
[http://dx.doi.org/10.1016/j.procs.2020.04.116] 
[47] 

 S. Lakshmi V, and MS Vijaya, "Efficient Prediction of Phishing Websites using Supervised Learning Algorithms Procedia Engineering, Elsevier, vol. 30, pp. 798-805, 2012.
[http://dx.doi.org/10.1016/j.proeng.2012.01.930] 
[48] 
H. Zuhair, A. Selamat,  and M. Salleh, "Feature selection for phishing detection: a review of research", Int. J. Intell. Syst. Technol. Appl., vol. 15, pp. 147-162, 2016.
[http://dx.doi.org/10.1504/IJISTA.2016.076495] 
[49] 
W. Ali, "Phishing Website Detection based on Supervised Machine Learning with Wrapper Features Selection", Int. J. Adv. Comput. Sci. Appl., vol. 8, pp. 72-78, 2017.
[http://dx.doi.org/10.14569/IJACSA.2017.080910] 
[50] 
M.A.U.H. Tahir, S. Asghar, A. Zafar,  and S. Gillani, "A Hybrid Model to Detect Phishing-Sites Using Supervised Learning Algorithms", 2016 International Conference on Computational Science and Computational Intelligence (CSCI), 2016pp. 1126-1133 
[http://dx.doi.org/10.1109/CSCI.2016.0214] 

Rights & Permissions Print Cite

Article Metrics

9

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/2666255814666210322162129	Print ISSN 2666-2558
Publisher Name Bentham Science Publisher	Online ISSN 2666-2566

Recent Advances in Computer Science and Communications

An Integrated Approach of Proposed Pruning Based Feature Selection Technique (PBFST) for Phishing E-mail Detection

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract