DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode

The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts ( e.g. , source code or executable code) into a f...

Full description

Saved in:
Bibliographic Details
Main Authors: SUN, Tiezhu, ALLIX, Kevin, KIM, Kisub, ZHOU, Xin, KIM, Dongsun, LO, David, BISSYANDE, Tegawendé F., KLEIN, Jacques
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8509
https://ink.library.smu.edu.sg/context/sis_research/article/9512/viewcontent/DexBERT_Effective_Task_Agnostic_and_Fine_Grained_Representation_Learning_of_Android_Bytecode.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9512
record_format dspace
spelling sg-smu-ink.sis_research-95122024-01-22T15:11:03Z DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode SUN, Tiezhu ALLIX, Kevin KIM, Kisub ZHOU, Xin KIM, Dongsun LO, David BISSYANDE, Tegawendé F. KLEIN, Jacques The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts ( e.g. , source code or executable code) into a form that is suitable for learning. Traditionally, researchers and practitioners have relied on manually selected features, based on expert knowledge, for the task at hand. Such knowledge is sometimes imprecise and generally incomplete. To overcome this limitation, many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations and selections of the most relevant features. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level ( e.g. , apk2vec ) or conducted for one specific downstream task ( e.g. , smali2vec ). Thus, the produced representation may turn out to be unsuitable for fine-grained tasks or cannot generalize beyond the task that they have been trained on. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks ( e.g. , at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task. 2023-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8509 info:doi/10.1109/TSE.2023.3310874 https://ink.library.smu.edu.sg/context/sis_research/article/9512/viewcontent/DexBERT_Effective_Task_Agnostic_and_Fine_Grained_Representation_Learning_of_Android_Bytecode.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Representation learning Android app analysis Code representation Malicious code localization Defect prediction Predictive models Operating systems Software engineering Artificial Intelligence and Robotics OS and Networks Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Representation learning
Android app analysis
Code representation
Malicious code localization
Defect prediction
Predictive models
Operating systems
Software engineering
Artificial Intelligence and Robotics
OS and Networks
Software Engineering
spellingShingle Representation learning
Android app analysis
Code representation
Malicious code localization
Defect prediction
Predictive models
Operating systems
Software engineering
Artificial Intelligence and Robotics
OS and Networks
Software Engineering
SUN, Tiezhu
ALLIX, Kevin
KIM, Kisub
ZHOU, Xin
KIM, Dongsun
LO, David
BISSYANDE, Tegawendé F.
KLEIN, Jacques
DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode
description The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts ( e.g. , source code or executable code) into a form that is suitable for learning. Traditionally, researchers and practitioners have relied on manually selected features, based on expert knowledge, for the task at hand. Such knowledge is sometimes imprecise and generally incomplete. To overcome this limitation, many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations and selections of the most relevant features. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level ( e.g. , apk2vec ) or conducted for one specific downstream task ( e.g. , smali2vec ). Thus, the produced representation may turn out to be unsuitable for fine-grained tasks or cannot generalize beyond the task that they have been trained on. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks ( e.g. , at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.
format text
author SUN, Tiezhu
ALLIX, Kevin
KIM, Kisub
ZHOU, Xin
KIM, Dongsun
LO, David
BISSYANDE, Tegawendé F.
KLEIN, Jacques
author_facet SUN, Tiezhu
ALLIX, Kevin
KIM, Kisub
ZHOU, Xin
KIM, Dongsun
LO, David
BISSYANDE, Tegawendé F.
KLEIN, Jacques
author_sort SUN, Tiezhu
title DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode
title_short DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode
title_full DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode
title_fullStr DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode
title_full_unstemmed DexBERT: Effective, task-agnostic and fine-grained representation learning of Android bytecode
title_sort dexbert: effective, task-agnostic and fine-grained representation learning of android bytecode
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8509
https://ink.library.smu.edu.sg/context/sis_research/article/9512/viewcontent/DexBERT_Effective_Task_Agnostic_and_Fine_Grained_Representation_Learning_of_Android_Bytecode.pdf
_version_ 1789483255968301056