Machine learning based approaches towards robust Android malware detection

The Android platform is becoming increasingly popular and numerous applications (apps) have been developed by organizations to meet the ever increasing market demand over years. Naturally, security and privacy concerns on Android apps have grabbed considerable attention from both academic and indust...

Full description

Saved in:
Bibliographic Details
Main Author: XU, Jiayun
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2021
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/320
https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1320&context=etd_coll
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:The Android platform is becoming increasingly popular and numerous applications (apps) have been developed by organizations to meet the ever increasing market demand over years. Naturally, security and privacy concerns on Android apps have grabbed considerable attention from both academic and industrialcommunities. Many approaches have been proposed to detect Android malware in different ways so far, and most of them produce satisfactory performance under the given Android environment settings and labelled samples. However, existing approaches suffer the following robustness problems: In many Android malware detection approaches, specific API calls are used to build the feature sets, and their feature sets are fixed once the model has been trained. However, such feature sets lack of robustnessagainst the change of available APIs. Since there are always new APIs released with old ones deprecated during the evolvement of Android specifications. If developers switch from old APIs to new ones in app development, older Android malware detection models which are trained before the release of new APIs may not be effective then, because these new APIs are not included in the previously fixed feature sets. Besides, existing approaches are also lack of robustness towards the label noises. Recent research discovered that sample labels provided by malware detection websites may not be always reliable, and we also figure out that 10% of sample labels provided by VirusTotal change during a period of 2 years in our experiments. This indicated label noises cannot be ignored in the training of Android malware detection models, while existing approaches which directly use the provided labels will suffer from the label noise problem. Furthermore, even if the sample labels are correct, there may still exist inconsistencies between the sample labels and the generated feature vectors in dynamic-based Android malware detection approaches. Since notriggering modules can perfectly trigger all potential malicious behaviors, and anti-analysis techniques are common in the apps. In this case, the triggered behavior traces collected from samples labelled as “malware”may not contain “malicious” behaviors, thus feature vectors built from such traces may become noises in the model training. Towards the above problems, three different works are presented in this dissertation to provide robustness to Android malware detection in different ways: The first work in this dissertation proposes a slow-aging Android malware detection solution named SDAC. Towards solving the model aging problem, SDAC evolves its feature set effectively by evaluating new APIs’ contributions to malware detection using existing APIs’ contributions. In detail, SDAC evaluates the contributions of APIs using their contexts in the API call sequences. These sequences are extracted from Android apps demonstrating how the APIs are used in real world cases. Based on these sequences, an embedding algorithm named API2Vec is deployed to map APIs into a vector space in which the differences among API vectors are regarded as the semantic distances. Then SDAC clusters all these APIs based on the semantic distances among them to create a feature set in the training phase, and extends the feature set to include all new APIs in the detecting phase. By the feature extension, SDAC can adapt to the changes in Android specifications and thus produces a robust approach against changes in Android OS specifications. The second work in this dissertation is named Differential Training, which is a general framework designed to reduce the noise level of training data for any machine learning-based Android malware detection approach. We discover that labels of samples provided by Anti-Virus organizations change over time. The changes imply certain labels are erroneous, and thus distort the performance when such labels are used in training Android malware detection models. Differential Training, which functions as a general framework, candetect label noises with different Android malware detection approaches. For the input sample apps, Differential Training firstly generates the noise detection feature vectors from all the intermediate states of two identical deep learning classification models. Then it applies outlier detection algorithms on these noisedetection feature vectors, and the outliers detected are regarded as coming from noises. With the label noises being detected and reduced, Differential Training can thus help improve the detection accuracy of Android malware detection approaches. The third work in the dissertation is a noise-tolerant dynamic-based Android malware detection approach named Dynamic Attention. In dynamic-based Android malware detection approaches, the triggered behavior traces collected from samples with “malware” labels may not contain “malicious” behaviors due to the imperfect trigger procedure or anti-analysis methods, so they are in fact mislabelled when used in training Android malware detection models. Dynamic Attention is thus designed to solve this mislabelling problem: it identifies the label noises based on the variances of the attention weights associated within the behavior traces derived from malicious apps, and assigns correctlylabelled behavior traces with high weights and wrongly-labelled ones with low weights during the model training. By doing so, Dynamic Attention makesthe classification model learn less from wrongly-labelled feature vectors and gains resistances against the noises. This approach also enjoys high practicality, since it relies on neither domain knowledge nor manual inspection in the model training. This dissertation contributes to the robustness of Android malware detection approaches in various ways. In particular, SDAC is robust towards changes in Android specifications, Differential Training provides robustness against label noises for Android malware detection in static analysis, and Dynamic Attentionachieves the same goal for Android malware detection in dynamic analysis.