Leveraging deep learning techniques to secure software development

Adversarial threats have grown rapidly in recent years, resulting in the growing importance of software security. Many processes in the software development life cycle seek to reduce attack surfaces in the codebase, e.g., black box testing, code reviews, and static code analysis. The key idea is to...

Full description

Saved in:
Bibliographic Details
Main Author: Siow, Jing Kai
Other Authors: Liu Yang
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/160286
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Adversarial threats have grown rapidly in recent years, resulting in the growing importance of software security. Many processes in the software development life cycle seek to reduce attack surfaces in the codebase, e.g., black box testing, code reviews, and static code analysis. The key idea is to reduce implementation bugs, such as out-of-bounds bugs and memory leaks, during the development phase. However, they often require an intensive amount of resources, further reducing the productivity of developers. Therefore, automation in software development processes is highly sought. The vast amount of code-related data contributes massively to the domain of software security. With open-source information widely available, e.g., vulnerability databases, open-sourced codebase, and security patches, many data-driven approaches are employed to enhance the security of our cyberspace. This thesis presents my approach to enhancing software security throughout multiple development stages with code intelligent tasks. The main objective of this thesis is to increase the security in the codebase during software development by leveraging data-driven and deep learning techniques. Vulnerability commonly occurs in the software development phase and might persist after the deployment. We propose a data-driven approach to increase the quality of the source code, reducing the number of security bugs and errors that might occur during the development phase. Specifically, we propose a deep learning approach, CORE, in automating the code review process. CORE employs a multi-level embedding layer in representing and learning the relevancy between the source code and their respective submitted reviews. During its inference phase, it suggests the most relevant reviews for given code submission. Our experiments further show that CORE achieves up to 0.234% in MRR and 0.482% in Recall@10 at suggesting reviews. Patch management is a common process in software security. It ensures that all software is up-to-date and does not contain any exposures to vulnerabilities. However, the amount of officially published security patches is far from complete. Hence, we propose our work in Patch Curation and Security Patch Identification, SPI, which aims to collect unofficial security patches that lure silently in an open-source project. Due to the vast amount of necessary data in data-driven approaches, the dataset on security patches is still lacking. To enable an effective patching strategy, we propose our approach in finding security patches amidst open-source projects through a sophisticated deep-learning mining pipeline. We propose a three steps process in identifying security patches: Keyword Filtering, Manual Verification, and Deep Learning Patch Identification. Our experiments demonstrate the high performance of SPI, achieving up to 87.93% F1-score in identifying security patches. We further evaluate SPI in a production environment, showing that SPI can benefit both researchers and developers in future research and patch management. Even though security measures are always in place, vulnerability still sneaks past them and appears in the published software. During the maintenance phase of the software development, developers resolve vulnerability and bugs, ensuring that the codebase is secured and correct. However, the time to resolve the vulnerability is crucial as this vulnerable period exposes the software to cyber-attacks and adversarial threats. Hence, to reduce the duration of this period, we present our approach in automated program repair, Ratchet. We employ a learning-based approach in repairing programs to ensure that real patches can be generated effectively and without manual effort. Specifically, we present our deep learning-based transformer model in learning and generating patches among open-source projects. Furthermore, we augmented our generation process with the retrieval information to enhance the patch generation process. Ratchet outperforms deep learning approaches on fault localization with 39.8-96.4% in accuracy and patch generation with 18.4-46.4% in repair accuracy. Despite great performance in employing deep learning techniques in software engineering tasks, various code representations can be employed. Different representations inherently convey different meanings and semantics of the source code. To facilitate future works in software security and engineering code intelligent tasks, we conclude the thesis with an empirical study of code representations across three code intelligent tasks: Code Classification, Vulnerability Detection, and Clone Detection. Our study shows that graph representations are superior to other forms of code representation, showing huge potential in representing source code with program graph. This work serves as a foundation for potential research directions, enabling us to investigate deeper into a better code representation.