Abstract
Malicious Software (MALWARE) is a serious threat to system security the
moment any electronic gadget or ‘Thing’ is connected to the World Wide Web
(WWW). The malware - stealthy software that is used to collect sensitive information
gains access to private systems and can disrupt device operation. Thus, malware acts
against the user requirement and is a threat to all operating systems (OS), but more to
Windows and Android systems, as those are the most widely used OS. Malware
developers try to invade the system by means of viruses, adware, spyware,
ransomware, botware, Trojans, etc. Developers try different anti-forensic techniques so
that malware cannot be detected or investigated. Malware developers typically play
‘peekaboo’ with the malware investigators. The result is that investigating such attacks
becomes more complex, and many times it fails because of immature forensics
methodology or a lack of appropriate tools. This chapter is the first step towards
analysing malware. The process started with malware dataset collection and
understanding the same. ML has two basic blocks, i.e., feature extraction and
classification. In the case of supervised learning, this feature plays a significant role.
This asks for understanding features and their effect on classification, which was a
major task. Two separate experimental processes were explored. The first one involved
extracting n-grams from the binary files using the kfNgram tool, and the second one
used a shell script to parse the assembly files for method calls to external API libraries.
Several supervised machine learning classifiers like Decision Trees, SVM, and Naive
Bayes were used to classify the malware family based on extracted features. We
proposed a method to classify malware into nine families as per the Kaggle dataset. It
analyses the n-gram of the malware file to generate the feature vector. Here, the value
of ’n’ in n-gram is selectable; presently, it is four. The objective was to extract highly
probable n-grams from the binary files after pre-processing, i.e., calculating the IG
parameter. The present threshold for selecting n-gram from the top-most lists is five
hundred. It has been observed that SVM and Decision trees provide accuracy on the
scale of 98%. Nevertheless, there are chances of improvement as there is a probability
of selecting irrelevant n-grams due to the sequential selection of n-grams. This method
is considered a starting point for malware classification.