Learning program semantics via exploring program structures with deep learning

The ubiquitousness of software in modern society and the boom in open-source software have made software engineering into the “big code” era. The availability of code-related data is massive (e.g., billions of code, millions of code changes, bug fixes and code documentation), which yields a hot topi...

Full description

Saved in:
Bibliographic Details
Main Author: Liu, Shangqing
Other Authors: Liu Yang
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/164454
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The ubiquitousness of software in modern society and the boom in open-source software have made software engineering into the “big code” era. The availability of code-related data is massive (e.g., billions of code, millions of code changes, bug fixes and code documentation), which yields a hot topic in both academia and industry, which is how to adopt the data-driven approach (e.g., deep learning) to solve conventional software engineering (SE) problems. Many works analogize programs to sequential text in natural language with some sequential-based models such as RNNs, LSTMs or the Transformer to learn the program semantics. However, though the programs can be considered as a flat sequence roughly, they tend to consist of different structures in a program such as abstract syntax tree, control flow, and data flow. In this thesis, we explore how to encode program structures beyond the program text with deep learning techniques to learn program semantics for various applications. First, we aim at exploring the way to inject the program structures into sequential-based models such as LSTMs to improve the model on learning program semantics. Because of the rapidly increasing number of open-source software, currently, millions of projects are hosted on the software hosting platform (e.g., GitHub), describing the code changes among different versions of the software in the natural language (i.e., commit message) can greatly facilitate the users and developers to understand the version evolution rapidly. Hence, we target commit message generation. Specifically, since abstract syntax tree (AST) is the first-step representation used by the program parser to understand the semantics of the program, we extract a set of AST paths behind the code changes and feed them with the LSTM-based encoder-decoder framework for the message generation. We evaluate our approach by comparing it with some sequential-based techniques, which leverage the text of the changed code with LSTMs or its variants for the generation. The results demonstrate that extracting structures hidden in the program is able to facilitate the neural model to learn the program semantics and yield more accurate results. Second, besides ASTs, a program can also be represented in other structures such as a control flow graph and data flow graph. We propose a composite programming representation technique Devign, which combines code property graph (CPG), consisting of abstract syntax tree (AST), control flow graph (CFG), and data flow graph (DFG), with natural code sequence (NCS) to represent a program and further leverage graph neural networks (GNNs) for vulnerability identification. Software vulnerability identification is a complicated task and it requires the model to learn the comprehensive program semantics for accurate identification. To address the limitation of currently no public large-scale real vulnerability dataset, we design an algorithm with a cross-validation process by four security experts for reliable data collection. Furthermore, we have made our partial data public to benefit academia and industry. In addition, we pro- pose a convolution module followed by a gated graph neural network (GGNN) to learn the fine-grained vulnerability features. The extensive experimental results in terms of accuracy and F1 confirm that when combing all kinds of program structures, our approach exceeds any one of the program structures and achieves the best performance for vulnerability identification. Third, source code summarization targets to describe the functionality of a code snippet in the natural language and an accurate source code summarization system can help the developers understand the functional logic without reading the code. However, a reliable code summarization system requires the model to have the capacity to learn program semantics to generate an accurate description. Inspired by our previous work Devign, we utilize CPG, which incorporates AST, CFG and DFG to represent the code snippet. We further propose a graph-based encoder-decoder model (i.e., HGNN) for code summarization, where the input is the code property graph and the output is the code summary. To improve the learning capacity of the model, we propose global attention on any pair of nodes over a graph to mitigate conventional GNN models, which only capture the local neighbourhood information. In addition, we also innovate a retrieval-based augmentation mechanism to retrieve the most similar source code with the current program from the retrieval database and combine the retrieved code as well as the corresponding summary to further improve the quality of the generated summary. The extensive evaluation has demonstrated that HGNN is more powerful than conventional GNNs such as the graph convolution network (GCN) and the graph attention network (GAT). Furthermore, our proposed source code summarization system can generate more high-quality summaries. Fourth, code search aims at retrieving the semantic-equivalent code snippets from a large code corpus when providing a query in the natural language. It is a critical issue especially in the “big code” era since some studies have confirmed that more than 90% of the efforts from software developers aim to reuse the existing code. Besides constructing the program graph with various types of edges on AST to represent a program, we further propose to convert the natural language query to a dependency graph to explicitly describe the token relations in the query and design our neural network architecture GraphSearchNet, which considers both the program and the query as the graphs and incorporates the bidirectional gated graph neural network (BiGGNN) with the multi-head attention mechanism for semantic code search. An extensive experiment compared with 11 baseline approaches has demonstrated that by jointly learning the structure behind the program and the natural language query, the semantic mapping relations can be easily captured. Finally, the aforementioned models are all trained in a supervised manner, however, recently, unsupervised techniques have demonstrated their superiority against the supervised techniques and the large-scale pre-trained models in the code scenario such as CodeBERT, GraphCodeBERT has provided significant improvements over various downstream software engineering tasks such as clone detection, code translation. Specifically, GraphCodeBERT incorporates data flow information of the program to neural networks at the pre-training phase to learn general code representations. It has achieved state-of-the-art performance for different downstream tasks. We want to explore the other questions in these large-scale pre-trained code models such as robustness. From our preliminary findings, we observe that these pre-trained models are not robust to the label-preserving program mutations and the simple variable renaming operation will degrade the performance significantly. It proves that program semantics are not captured well by these pre-trained models. To address this limitation, we propose our framework ContraBERT to enhance the robustness of existing pre-trained models. We design a set of data augmentation operators to construct different variants. Then we combine these variants with the original data and further train the existing pre-trained models with masked language modeling (MLM) and InfoNCE objective to enhance the model robustness. The experiments from the open-source benchmark CodeXGLUE have confirmed the robustness of the pre-trained models is improved. In addition, we also confirm that these robustness-enhanced models could provide significant improvements on different tasks compared with the original model.