Calculating distances between Windows malware using siamese neural network embeddings

In recent years, the rate of growth of unique Windows malware samples has grown significantly. This rapid growth has made manual inspection of every malware sample an impossible task. One way to minimize this problem is through auto clustering of unknown malware samples into clusters of similar file...

Full description

Saved in:
Bibliographic Details
Main Author: Sison, Marc Oliver Tan
Format: text
Language:English
Published: Animo Repository 2021
Subjects:
Online Access:https://animorepository.dlsu.edu.ph/etdm_comsci/12
https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1014&context=etdm_comsci
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: De La Salle University
Language: English
id oai:animorepository.dlsu.edu.ph:etdm_comsci-1014
record_format eprints
spelling oai:animorepository.dlsu.edu.ph:etdm_comsci-10142022-01-07T07:11:29Z Calculating distances between Windows malware using siamese neural network embeddings Sison, Marc Oliver Tan In recent years, the rate of growth of unique Windows malware samples has grown significantly. This rapid growth has made manual inspection of every malware sample an impossible task. One way to minimize this problem is through auto clustering of unknown malware samples into clusters of similar files. Auto clustering done in this way would allow malware researchers to identify large clusters, as well as analyzing entire clusters using only a few representatives of each cluster. Much work has been done in machine learning with regards to the problem of clustering malware samples. However, previous work has mostly focused on clustering into known malware families, or require dynamic features which are prohibitively slow to extract given the amount of new malware samples. This paper proposes training a siamese neural network using engineered static features to generate embeddings that can be used to calculate the distances between malware files. The engineered features would be carefully chosen so that the distances calculated from the resulting embeddings would be resistant to a certain degree of malware metamorphism, as well as generalizing well to Windows files as a whole instead of specific malware families. This would also enable a type of one-shot learning detection, where multiple unknown malware samples can be detected using the distance from a known malicious files. 2021-09-10T07:00:00Z text application/pdf https://animorepository.dlsu.edu.ph/etdm_comsci/12 https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1014&context=etdm_comsci Computer Science Master's Theses English Animo Repository Malware (Computer software) Neural networks (Computer science) Machine learning Computer Sciences
institution De La Salle University
building De La Salle University Library
continent Asia
country Philippines
Philippines
content_provider De La Salle University Library
collection DLSU Institutional Repository
language English
topic Malware (Computer software)
Neural networks (Computer science)
Machine learning
Computer Sciences
spellingShingle Malware (Computer software)
Neural networks (Computer science)
Machine learning
Computer Sciences
Sison, Marc Oliver Tan
Calculating distances between Windows malware using siamese neural network embeddings
description In recent years, the rate of growth of unique Windows malware samples has grown significantly. This rapid growth has made manual inspection of every malware sample an impossible task. One way to minimize this problem is through auto clustering of unknown malware samples into clusters of similar files. Auto clustering done in this way would allow malware researchers to identify large clusters, as well as analyzing entire clusters using only a few representatives of each cluster. Much work has been done in machine learning with regards to the problem of clustering malware samples. However, previous work has mostly focused on clustering into known malware families, or require dynamic features which are prohibitively slow to extract given the amount of new malware samples. This paper proposes training a siamese neural network using engineered static features to generate embeddings that can be used to calculate the distances between malware files. The engineered features would be carefully chosen so that the distances calculated from the resulting embeddings would be resistant to a certain degree of malware metamorphism, as well as generalizing well to Windows files as a whole instead of specific malware families. This would also enable a type of one-shot learning detection, where multiple unknown malware samples can be detected using the distance from a known malicious files.
format text
author Sison, Marc Oliver Tan
author_facet Sison, Marc Oliver Tan
author_sort Sison, Marc Oliver Tan
title Calculating distances between Windows malware using siamese neural network embeddings
title_short Calculating distances between Windows malware using siamese neural network embeddings
title_full Calculating distances between Windows malware using siamese neural network embeddings
title_fullStr Calculating distances between Windows malware using siamese neural network embeddings
title_full_unstemmed Calculating distances between Windows malware using siamese neural network embeddings
title_sort calculating distances between windows malware using siamese neural network embeddings
publisher Animo Repository
publishDate 2021
url https://animorepository.dlsu.edu.ph/etdm_comsci/12
https://animorepository.dlsu.edu.ph/cgi/viewcontent.cgi?article=1014&context=etdm_comsci
_version_ 1722366384417013760