Leveraging deep learning techniques to secure software development

Adversarial threats have grown rapidly in recent years, resulting in the growing importance of software security. Many processes in the software development life cycle seek to reduce attack surfaces in the codebase, e.g., black box testing, code reviews, and static code analysis. The key idea is to...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Siow, Jing Kai
مؤلفون آخرون:	Liu Yang
التنسيق:	Thesis-Doctor of Philosophy
اللغة:	English
منشور في:	Nanyang Technological University 2022
الموضوعات:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/160286
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Nanyang Technological University
اللغة:	English

id	sg-ntu-dr.10356-160286
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Siow, Jing Kai Leveraging deep learning techniques to secure software development
description	Adversarial threats have grown rapidly in recent years, resulting in the growing importance of software security. Many processes in the software development life cycle seek to reduce attack surfaces in the codebase, e.g., black box testing, code reviews, and static code analysis. The key idea is to reduce implementation bugs, such as out-of-bounds bugs and memory leaks, during the development phase. However, they often require an intensive amount of resources, further reducing the productivity of developers. Therefore, automation in software development processes is highly sought. The vast amount of code-related data contributes massively to the domain of software security. With open-source information widely available, e.g., vulnerability databases, open-sourced codebase, and security patches, many data-driven approaches are employed to enhance the security of our cyberspace. This thesis presents my approach to enhancing software security throughout multiple development stages with code intelligent tasks. The main objective of this thesis is to increase the security in the codebase during software development by leveraging data-driven and deep learning techniques. Vulnerability commonly occurs in the software development phase and might persist after the deployment. We propose a data-driven approach to increase the quality of the source code, reducing the number of security bugs and errors that might occur during the development phase. Specifically, we propose a deep learning approach, CORE, in automating the code review process. CORE employs a multi-level embedding layer in representing and learning the relevancy between the source code and their respective submitted reviews. During its inference phase, it suggests the most relevant reviews for given code submission. Our experiments further show that CORE achieves up to 0.234% in MRR and 0.482% in Recall@10 at suggesting reviews. Patch management is a common process in software security. It ensures that all software is up-to-date and does not contain any exposures to vulnerabilities. However, the amount of officially published security patches is far from complete. Hence, we propose our work in Patch Curation and Security Patch Identification, SPI, which aims to collect unofficial security patches that lure silently in an open-source project. Due to the vast amount of necessary data in data-driven approaches, the dataset on security patches is still lacking. To enable an effective patching strategy, we propose our approach in finding security patches amidst open-source projects through a sophisticated deep-learning mining pipeline. We propose a three steps process in identifying security patches: Keyword Filtering, Manual Verification, and Deep Learning Patch Identification. Our experiments demonstrate the high performance of SPI, achieving up to 87.93% F1-score in identifying security patches. We further evaluate SPI in a production environment, showing that SPI can benefit both researchers and developers in future research and patch management. Even though security measures are always in place, vulnerability still sneaks past them and appears in the published software. During the maintenance phase of the software development, developers resolve vulnerability and bugs, ensuring that the codebase is secured and correct. However, the time to resolve the vulnerability is crucial as this vulnerable period exposes the software to cyber-attacks and adversarial threats. Hence, to reduce the duration of this period, we present our approach in automated program repair, Ratchet. We employ a learning-based approach in repairing programs to ensure that real patches can be generated effectively and without manual effort. Specifically, we present our deep learning-based transformer model in learning and generating patches among open-source projects. Furthermore, we augmented our generation process with the retrieval information to enhance the patch generation process. Ratchet outperforms deep learning approaches on fault localization with 39.8-96.4% in accuracy and patch generation with 18.4-46.4% in repair accuracy. Despite great performance in employing deep learning techniques in software engineering tasks, various code representations can be employed. Different representations inherently convey different meanings and semantics of the source code. To facilitate future works in software security and engineering code intelligent tasks, we conclude the thesis with an empirical study of code representations across three code intelligent tasks: Code Classification, Vulnerability Detection, and Clone Detection. Our study shows that graph representations are superior to other forms of code representation, showing huge potential in representing source code with program graph. This work serves as a foundation for potential research directions, enabling us to investigate deeper into a better code representation.
author2	Liu Yang
author_facet	Liu Yang Siow, Jing Kai
format	Thesis-Doctor of Philosophy
author	Siow, Jing Kai
author_sort	Siow, Jing Kai
title	Leveraging deep learning techniques to secure software development
title_short	Leveraging deep learning techniques to secure software development
title_full	Leveraging deep learning techniques to secure software development
title_fullStr	Leveraging deep learning techniques to secure software development
title_full_unstemmed	Leveraging deep learning techniques to secure software development
title_sort	leveraging deep learning techniques to secure software development
publisher	Nanyang Technological University
publishDate	2022
url	https://hdl.handle.net/10356/160286
_version_	1743119495646412800
spelling	sg-ntu-dr.10356-1602862022-08-01T05:07:18Z Leveraging deep learning techniques to secure software development Siow, Jing Kai Liu Yang School of Computer Science and Engineering Cybersecurity Lab yangliu@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Adversarial threats have grown rapidly in recent years, resulting in the growing importance of software security. Many processes in the software development life cycle seek to reduce attack surfaces in the codebase, e.g., black box testing, code reviews, and static code analysis. The key idea is to reduce implementation bugs, such as out-of-bounds bugs and memory leaks, during the development phase. However, they often require an intensive amount of resources, further reducing the productivity of developers. Therefore, automation in software development processes is highly sought. The vast amount of code-related data contributes massively to the domain of software security. With open-source information widely available, e.g., vulnerability databases, open-sourced codebase, and security patches, many data-driven approaches are employed to enhance the security of our cyberspace. This thesis presents my approach to enhancing software security throughout multiple development stages with code intelligent tasks. The main objective of this thesis is to increase the security in the codebase during software development by leveraging data-driven and deep learning techniques. Vulnerability commonly occurs in the software development phase and might persist after the deployment. We propose a data-driven approach to increase the quality of the source code, reducing the number of security bugs and errors that might occur during the development phase. Specifically, we propose a deep learning approach, CORE, in automating the code review process. CORE employs a multi-level embedding layer in representing and learning the relevancy between the source code and their respective submitted reviews. During its inference phase, it suggests the most relevant reviews for given code submission. Our experiments further show that CORE achieves up to 0.234% in MRR and 0.482% in Recall@10 at suggesting reviews. Patch management is a common process in software security. It ensures that all software is up-to-date and does not contain any exposures to vulnerabilities. However, the amount of officially published security patches is far from complete. Hence, we propose our work in Patch Curation and Security Patch Identification, SPI, which aims to collect unofficial security patches that lure silently in an open-source project. Due to the vast amount of necessary data in data-driven approaches, the dataset on security patches is still lacking. To enable an effective patching strategy, we propose our approach in finding security patches amidst open-source projects through a sophisticated deep-learning mining pipeline. We propose a three steps process in identifying security patches: Keyword Filtering, Manual Verification, and Deep Learning Patch Identification. Our experiments demonstrate the high performance of SPI, achieving up to 87.93% F1-score in identifying security patches. We further evaluate SPI in a production environment, showing that SPI can benefit both researchers and developers in future research and patch management. Even though security measures are always in place, vulnerability still sneaks past them and appears in the published software. During the maintenance phase of the software development, developers resolve vulnerability and bugs, ensuring that the codebase is secured and correct. However, the time to resolve the vulnerability is crucial as this vulnerable period exposes the software to cyber-attacks and adversarial threats. Hence, to reduce the duration of this period, we present our approach in automated program repair, Ratchet. We employ a learning-based approach in repairing programs to ensure that real patches can be generated effectively and without manual effort. Specifically, we present our deep learning-based transformer model in learning and generating patches among open-source projects. Furthermore, we augmented our generation process with the retrieval information to enhance the patch generation process. Ratchet outperforms deep learning approaches on fault localization with 39.8-96.4% in accuracy and patch generation with 18.4-46.4% in repair accuracy. Despite great performance in employing deep learning techniques in software engineering tasks, various code representations can be employed. Different representations inherently convey different meanings and semantics of the source code. To facilitate future works in software security and engineering code intelligent tasks, we conclude the thesis with an empirical study of code representations across three code intelligent tasks: Code Classification, Vulnerability Detection, and Clone Detection. Our study shows that graph representations are superior to other forms of code representation, showing huge potential in representing source code with program graph. This work serves as a foundation for potential research directions, enabling us to investigate deeper into a better code representation. Doctor of Philosophy 2022-07-20T05:40:42Z 2022-07-20T05:40:42Z 2022 Thesis-Doctor of Philosophy Siow, J. K. (2022). Leveraging deep learning techniques to secure software development. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/160286 https://hdl.handle.net/10356/160286 10.32657/10356/160286 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Leveraging deep learning techniques to secure software development

مواد مشابهة