Unleashing the power of pseudo-code for binary code similarity analysis

Code similarity analysis has become more popular due to its significant applicantions, including vulnerability detection, malware detection, and patch analysis. Since the source code of the software is difficult to obtain under most circumstances, binary-level code similarity analysis (BCSA) has bee...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhang, Weiwei, Xu, Zhengzi, Xiao, Yang, Xue, Yinxing
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2023
Subjects:	Engineering::Computer science and engineering Binary Code Similarity Machine Learning
Online Access:	https://hdl.handle.net/10356/165104
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-165104
record_format	dspace
spelling	sg-ntu-dr.10356-1651042023-03-17T15:35:49Z Unleashing the power of pseudo-code for binary code similarity analysis Zhang, Weiwei Xu, Zhengzi Xiao, Yang Xue, Yinxing School of Computer Science and Engineering Engineering::Computer science and engineering Binary Code Similarity Machine Learning Code similarity analysis has become more popular due to its significant applicantions, including vulnerability detection, malware detection, and patch analysis. Since the source code of the software is difficult to obtain under most circumstances, binary-level code similarity analysis (BCSA) has been paid much attention to. In recent years, many BCSA studies incorporating AI techniques focus on deriving semantic information from binary functions with code representations such as assembly code, intermediate representations, and control flow graphs to measure the similarity. However, due to the impacts of different compilers, architectures, and obfuscations, binaries compiled from the same source code may vary considerably, which becomes the major obstacle for these works to obtain robust features. In this paper, we propose a solution, named UPPC (Unleashing the Power of Pseudo-code), which leverages the pseudo-code of binary function as input, to address the binary code similarity analysis challenge, since pseudo-code has higher abstraction and is platform-independent compared to binary instructions. UPPC selectively inlines the functions to capture the full function semantics across different compiler optimization levels and uses a deep pyramidal convolutional neural network to obtain the semantic embedding of the function. We evaluated UPPC on a data set containing vulnerabilities and a data set including different architectures (X86, ARM), different optimization options (O0-O3), different compilers (GCC, Clang), and four obfuscation strategies. The experimental results show that the accuracy of UPPC in function search is 33.2% higher than that of existing methods. Published version 2023-03-13T04:41:32Z 2023-03-13T04:41:32Z 2022 Journal Article Zhang, W., Xu, Z., Xiao, Y. & Xue, Y. (2022). Unleashing the power of pseudo-code for binary code similarity analysis. Cybersecurity, 5(1). https://dx.doi.org/10.1186/s42400-022-00121-0 2523-3246 https://hdl.handle.net/10356/165104 10.1186/s42400-022-00121-0 2-s2.0-85142935849 1 5 en Cybersecurity © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Binary Code Similarity Machine Learning
spellingShingle	Engineering::Computer science and engineering Binary Code Similarity Machine Learning Zhang, Weiwei Xu, Zhengzi Xiao, Yang Xue, Yinxing Unleashing the power of pseudo-code for binary code similarity analysis
description	Code similarity analysis has become more popular due to its significant applicantions, including vulnerability detection, malware detection, and patch analysis. Since the source code of the software is difficult to obtain under most circumstances, binary-level code similarity analysis (BCSA) has been paid much attention to. In recent years, many BCSA studies incorporating AI techniques focus on deriving semantic information from binary functions with code representations such as assembly code, intermediate representations, and control flow graphs to measure the similarity. However, due to the impacts of different compilers, architectures, and obfuscations, binaries compiled from the same source code may vary considerably, which becomes the major obstacle for these works to obtain robust features. In this paper, we propose a solution, named UPPC (Unleashing the Power of Pseudo-code), which leverages the pseudo-code of binary function as input, to address the binary code similarity analysis challenge, since pseudo-code has higher abstraction and is platform-independent compared to binary instructions. UPPC selectively inlines the functions to capture the full function semantics across different compiler optimization levels and uses a deep pyramidal convolutional neural network to obtain the semantic embedding of the function. We evaluated UPPC on a data set containing vulnerabilities and a data set including different architectures (X86, ARM), different optimization options (O0-O3), different compilers (GCC, Clang), and four obfuscation strategies. The experimental results show that the accuracy of UPPC in function search is 33.2% higher than that of existing methods.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Zhang, Weiwei Xu, Zhengzi Xiao, Yang Xue, Yinxing
format	Article
author	Zhang, Weiwei Xu, Zhengzi Xiao, Yang Xue, Yinxing
author_sort	Zhang, Weiwei
title	Unleashing the power of pseudo-code for binary code similarity analysis
title_short	Unleashing the power of pseudo-code for binary code similarity analysis
title_full	Unleashing the power of pseudo-code for binary code similarity analysis
title_fullStr	Unleashing the power of pseudo-code for binary code similarity analysis
title_full_unstemmed	Unleashing the power of pseudo-code for binary code similarity analysis
title_sort	unleashing the power of pseudo-code for binary code similarity analysis
publishDate	2023
url	https://hdl.handle.net/10356/165104
_version_	1761781299191742464

Unleashing the power of pseudo-code for binary code similarity analysis

Similar Items