GENERATING GRAYSCALE AND RGB IMAGES DATASET FOR WINDOWS PE MALWARE USING GIST FEATURES EXTRACTION METHOD

Advances in technology today are followed by increasing vulnerabilities and threats. In security there are various weaknesses caused by weaknesses in software, systems or networks, weaknesses caused by computer hardware, and weaknesses caused by humans as users of the system created. This weaknes...

Full description

Saved in:
Bibliographic Details
Main Author: Amalia Septiyani, Debi
Format: Final Project
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/67890
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:Advances in technology today are followed by increasing vulnerabilities and threats. In security there are various weaknesses caused by weaknesses in software, systems or networks, weaknesses caused by computer hardware, and weaknesses caused by humans as users of the system created. This weakness is exploited by irresponsible humans to gain profits by penetrating the system. One of the ways that cybercriminals do to carry out piracy, fraud, destruction, theft surveillance, and obtain confidential information is using malware. Malware or malicious software in this case is used to infiltrate the operating system in the form of software. Malware has several types including worms, viruses, trojans, spyware, rootkits, ransomware, and others. Not only the various types and mutations, malware has also increased in number from year to year. There have been many cases of large-scale malware attacks that are very detrimental and are aimed at institutions, banks, governments, companies, and even the general public. Several major cases of malware attacks that have occurred include the wannacry case in 2017 with losses exceeding 4 billion USD, the ILOVEYOU case in 2000 in the form of a worm that infected 45 million people with a loss of 15 billion USD, the Petya case in 2016 in the form of attacks on banks and airports with losses. reached 10 billion USD, the latest case of covidlock in 2020, as well as many other cases.[1] The fight against malware attacks is done by developing anti-malware that can pass system detection to identify and recover systems against threats more quickly. The development of malware detection with machine learning is one of the topics that the cyber security community is currently interested in.[2] This development certainly requires a dataset to accommodate this. One form of dataset that is currently in demand is image dataset. The image dataset provides visualization of images that can be observed by the eye with the many and varied models and supporting tools available with good accuracy results in previous studies.[3][4] One of the malware image datasets that are still in use today is the malimg dataset. The malimg dataset is a dataset that has 9339 samples which are divided into 25 family classes with grayscale images.[5] To provide additional insight to researchers and update the dataset, in this final project a malware image dataset was created using 1907 samples which were grouped into malicious and benign which were visualized with two images, RGB and grayscale. In this final project, the malicious and benign malware dataset in RGB and grayscale image format is created with four sub-systems. The first sub-system ii collects samples of malware windows PE files through the process of downloading and cloning from public repositories. The second sub-system performs a binary hashing process to obtain the SHA-256 hash of each sample. The third sub-system is scanned from each hash to VirusTotal to obtain information on malicious and benign samples and then group samples according to the scan results. The fourth sub-system, malicious and benign samples were then converted to image format with RGB and grayscale images. The results of the dataset obtained for the benign RGB image is 858 and the malicious is 1,025 and for the benign grayscale image is 872 and the malicious is 1,033. The dataset results were tested using a simple Convolutional Neural Network (CNN) algorithm to ensure the dataset is feasible to use.