Abstract
The massive number of sensors deployed in IoT generates humongous
volumes of data for a broad range of applications such as smart home, smart healthcare,
smart manufacturing, smart transportation, smart grid, smart agriculture etc. Analyzing
such data in order to facilitate enhanced decision making and increase productivity and
accuracy is a critical process for businesses and life improving paradigm. Machine
Learning would play a vital role in creating smarter techniques to predict the intruder
from the dataset. It has shown remarkable results in different fields, including Network
security, image recognition, information retrieval, speech recognition, natural language
processing, indoor localization, physiological and psychological state detection, etc. In
this regard, intrusion detection is becoming a research focus in the field of information
security. In our experiment, we used the CICIDS2017 data set to predict the Network
Intruder. The Canadian Institute of Cyber Security released the data set CICIDS-2017,
which consists of eight separate files and includes five days’ worth of normal cum
abnormal network packet data. The goal of this research is to examine relevant and
significant elements of large network packets in order to increase network packet attack
detection accuracy and reduce execution time. We choose important and meaningful
features by applying Information Gain, ranking and grouping features based on little
weight values on the CICIDS-2017 dataset; and then use Random Forest (RF), Random
Tree (RT), Naive Bayes (NB), Bayes Net (BN), and J48 classifier algorithms. The
findings of the experiment reveal that the amount of relevant and significant features
produced by Information Gain has a substantial impact on improving detection
accuracy and execution time. The Random Forest method, for example, has the best
accuracy with 0.14% of negative results when using 22 relevant selected features,
whereas the Random Tree classifier algorithm has a higher accuracy with 0.13% of
negative results when using 52 relevant selected features but takes a longer execution
time.