GENERATING GRAYSCALE AND RGB IMAGES DATASET FOR WINDOWS PE MALWARE USING GIST FEATURES EXTRACTION METHOD
Advances in technology today are followed by increasing vulnerabilities and threats. In security there are various weaknesses caused by weaknesses in software, systems or networks, weaknesses caused by computer hardware, and weaknesses caused by humans as users of the system created. This weaknes...
Saved in:
Main Author: | |
---|---|
Format: | Final Project |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/67890 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Advances in technology today are followed by increasing vulnerabilities and
threats. In security there are various weaknesses caused by weaknesses in software,
systems or networks, weaknesses caused by computer hardware, and weaknesses
caused by humans as users of the system created. This weakness is exploited by
irresponsible humans to gain profits by penetrating the system. One of the ways
that cybercriminals do to carry out piracy, fraud, destruction, theft surveillance,
and obtain confidential information is using malware.
Malware or malicious software in this case is used to infiltrate the operating system
in the form of software. Malware has several types including worms, viruses,
trojans, spyware, rootkits, ransomware, and others. Not only the various types and
mutations, malware has also increased in number from year to year. There have
been many cases of large-scale malware attacks that are very detrimental and are
aimed at institutions, banks, governments, companies, and even the general public.
Several major cases of malware attacks that have occurred include the wannacry
case in 2017 with losses exceeding 4 billion USD, the ILOVEYOU case in 2000 in
the form of a worm that infected 45 million people with a loss of 15 billion USD,
the Petya case in 2016 in the form of attacks on banks and airports with losses.
reached 10 billion USD, the latest case of covidlock in 2020, as well as many other
cases.[1]
The fight against malware attacks is done by developing anti-malware that can pass
system detection to identify and recover systems against threats more quickly. The
development of malware detection with machine learning is one of the topics that
the cyber security community is currently interested in.[2] This development
certainly requires a dataset to accommodate this. One form of dataset that is
currently in demand is image dataset. The image dataset provides visualization of
images that can be observed by the eye with the many and varied models and
supporting tools available with good accuracy results in previous studies.[3][4]
One of the malware image datasets that are still in use today is the malimg dataset.
The malimg dataset is a dataset that has 9339 samples which are divided into 25
family classes with grayscale images.[5] To provide additional insight to
researchers and update the dataset, in this final project a malware image dataset
was created using 1907 samples which were grouped into malicious and benign
which were visualized with two images, RGB and grayscale.
In this final project, the malicious and benign malware dataset in RGB and
grayscale image format is created with four sub-systems. The first sub-system
ii
collects samples of malware windows PE files through the process of downloading
and cloning from public repositories. The second sub-system performs a binary
hashing process to obtain the SHA-256 hash of each sample. The third sub-system
is scanned from each hash to VirusTotal to obtain information on malicious and
benign samples and then group samples according to the scan results. The fourth
sub-system, malicious and benign samples were then converted to image format
with RGB and grayscale images. The results of the dataset obtained for the benign
RGB image is 858 and the malicious is 1,025 and for the benign grayscale image
is 872 and the malicious is 1,033. The dataset results were tested using a simple
Convolutional Neural Network (CNN) algorithm to ensure the dataset is feasible to
use. |
---|