Leveraging deep learning techniques to secure software development

Adversarial threats have grown rapidly in recent years, resulting in the growing importance of software security. Many processes in the software development life cycle seek to reduce attack surfaces in the codebase, e.g., black box testing, code reviews, and static code analysis. The key idea is to...

Full description

Saved in:
Bibliographic Details
Main Author: Siow, Jing Kai
Other Authors: Liu Yang
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/160286
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-160286
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Siow, Jing Kai
Leveraging deep learning techniques to secure software development
description Adversarial threats have grown rapidly in recent years, resulting in the growing importance of software security. Many processes in the software development life cycle seek to reduce attack surfaces in the codebase, e.g., black box testing, code reviews, and static code analysis. The key idea is to reduce implementation bugs, such as out-of-bounds bugs and memory leaks, during the development phase. However, they often require an intensive amount of resources, further reducing the productivity of developers. Therefore, automation in software development processes is highly sought. The vast amount of code-related data contributes massively to the domain of software security. With open-source information widely available, e.g., vulnerability databases, open-sourced codebase, and security patches, many data-driven approaches are employed to enhance the security of our cyberspace. This thesis presents my approach to enhancing software security throughout multiple development stages with code intelligent tasks. The main objective of this thesis is to increase the security in the codebase during software development by leveraging data-driven and deep learning techniques. Vulnerability commonly occurs in the software development phase and might persist after the deployment. We propose a data-driven approach to increase the quality of the source code, reducing the number of security bugs and errors that might occur during the development phase. Specifically, we propose a deep learning approach, CORE, in automating the code review process. CORE employs a multi-level embedding layer in representing and learning the relevancy between the source code and their respective submitted reviews. During its inference phase, it suggests the most relevant reviews for given code submission. Our experiments further show that CORE achieves up to 0.234% in MRR and 0.482% in Recall@10 at suggesting reviews. Patch management is a common process in software security. It ensures that all software is up-to-date and does not contain any exposures to vulnerabilities. However, the amount of officially published security patches is far from complete. Hence, we propose our work in Patch Curation and Security Patch Identification, SPI, which aims to collect unofficial security patches that lure silently in an open-source project. Due to the vast amount of necessary data in data-driven approaches, the dataset on security patches is still lacking. To enable an effective patching strategy, we propose our approach in finding security patches amidst open-source projects through a sophisticated deep-learning mining pipeline. We propose a three steps process in identifying security patches: Keyword Filtering, Manual Verification, and Deep Learning Patch Identification. Our experiments demonstrate the high performance of SPI, achieving up to 87.93% F1-score in identifying security patches. We further evaluate SPI in a production environment, showing that SPI can benefit both researchers and developers in future research and patch management. Even though security measures are always in place, vulnerability still sneaks past them and appears in the published software. During the maintenance phase of the software development, developers resolve vulnerability and bugs, ensuring that the codebase is secured and correct. However, the time to resolve the vulnerability is crucial as this vulnerable period exposes the software to cyber-attacks and adversarial threats. Hence, to reduce the duration of this period, we present our approach in automated program repair, Ratchet. We employ a learning-based approach in repairing programs to ensure that real patches can be generated effectively and without manual effort. Specifically, we present our deep learning-based transformer model in learning and generating patches among open-source projects. Furthermore, we augmented our generation process with the retrieval information to enhance the patch generation process. Ratchet outperforms deep learning approaches on fault localization with 39.8-96.4% in accuracy and patch generation with 18.4-46.4% in repair accuracy. Despite great performance in employing deep learning techniques in software engineering tasks, various code representations can be employed. Different representations inherently convey different meanings and semantics of the source code. To facilitate future works in software security and engineering code intelligent tasks, we conclude the thesis with an empirical study of code representations across three code intelligent tasks: Code Classification, Vulnerability Detection, and Clone Detection. Our study shows that graph representations are superior to other forms of code representation, showing huge potential in representing source code with program graph. This work serves as a foundation for potential research directions, enabling us to investigate deeper into a better code representation.
author2 Liu Yang
author_facet Liu Yang
Siow, Jing Kai
format Thesis-Doctor of Philosophy
author Siow, Jing Kai
author_sort Siow, Jing Kai
title Leveraging deep learning techniques to secure software development
title_short Leveraging deep learning techniques to secure software development
title_full Leveraging deep learning techniques to secure software development
title_fullStr Leveraging deep learning techniques to secure software development
title_full_unstemmed Leveraging deep learning techniques to secure software development
title_sort leveraging deep learning techniques to secure software development
publisher Nanyang Technological University
publishDate 2022
url https://hdl.handle.net/10356/160286
_version_ 1743119495646412800
spelling sg-ntu-dr.10356-1602862022-08-01T05:07:18Z Leveraging deep learning techniques to secure software development Siow, Jing Kai Liu Yang School of Computer Science and Engineering Cybersecurity Lab yangliu@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Adversarial threats have grown rapidly in recent years, resulting in the growing importance of software security. Many processes in the software development life cycle seek to reduce attack surfaces in the codebase, e.g., black box testing, code reviews, and static code analysis. The key idea is to reduce implementation bugs, such as out-of-bounds bugs and memory leaks, during the development phase. However, they often require an intensive amount of resources, further reducing the productivity of developers. Therefore, automation in software development processes is highly sought. The vast amount of code-related data contributes massively to the domain of software security. With open-source information widely available, e.g., vulnerability databases, open-sourced codebase, and security patches, many data-driven approaches are employed to enhance the security of our cyberspace. This thesis presents my approach to enhancing software security throughout multiple development stages with code intelligent tasks. The main objective of this thesis is to increase the security in the codebase during software development by leveraging data-driven and deep learning techniques. Vulnerability commonly occurs in the software development phase and might persist after the deployment. We propose a data-driven approach to increase the quality of the source code, reducing the number of security bugs and errors that might occur during the development phase. Specifically, we propose a deep learning approach, CORE, in automating the code review process. CORE employs a multi-level embedding layer in representing and learning the relevancy between the source code and their respective submitted reviews. During its inference phase, it suggests the most relevant reviews for given code submission. Our experiments further show that CORE achieves up to 0.234% in MRR and 0.482% in Recall@10 at suggesting reviews. Patch management is a common process in software security. It ensures that all software is up-to-date and does not contain any exposures to vulnerabilities. However, the amount of officially published security patches is far from complete. Hence, we propose our work in Patch Curation and Security Patch Identification, SPI, which aims to collect unofficial security patches that lure silently in an open-source project. Due to the vast amount of necessary data in data-driven approaches, the dataset on security patches is still lacking. To enable an effective patching strategy, we propose our approach in finding security patches amidst open-source projects through a sophisticated deep-learning mining pipeline. We propose a three steps process in identifying security patches: Keyword Filtering, Manual Verification, and Deep Learning Patch Identification. Our experiments demonstrate the high performance of SPI, achieving up to 87.93% F1-score in identifying security patches. We further evaluate SPI in a production environment, showing that SPI can benefit both researchers and developers in future research and patch management. Even though security measures are always in place, vulnerability still sneaks past them and appears in the published software. During the maintenance phase of the software development, developers resolve vulnerability and bugs, ensuring that the codebase is secured and correct. However, the time to resolve the vulnerability is crucial as this vulnerable period exposes the software to cyber-attacks and adversarial threats. Hence, to reduce the duration of this period, we present our approach in automated program repair, Ratchet. We employ a learning-based approach in repairing programs to ensure that real patches can be generated effectively and without manual effort. Specifically, we present our deep learning-based transformer model in learning and generating patches among open-source projects. Furthermore, we augmented our generation process with the retrieval information to enhance the patch generation process. Ratchet outperforms deep learning approaches on fault localization with 39.8-96.4% in accuracy and patch generation with 18.4-46.4% in repair accuracy. Despite great performance in employing deep learning techniques in software engineering tasks, various code representations can be employed. Different representations inherently convey different meanings and semantics of the source code. To facilitate future works in software security and engineering code intelligent tasks, we conclude the thesis with an empirical study of code representations across three code intelligent tasks: Code Classification, Vulnerability Detection, and Clone Detection. Our study shows that graph representations are superior to other forms of code representation, showing huge potential in representing source code with program graph. This work serves as a foundation for potential research directions, enabling us to investigate deeper into a better code representation. Doctor of Philosophy 2022-07-20T05:40:42Z 2022-07-20T05:40:42Z 2022 Thesis-Doctor of Philosophy Siow, J. K. (2022). Leveraging deep learning techniques to secure software development. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/160286 https://hdl.handle.net/10356/160286 10.32657/10356/160286 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University