Statistical and deep learning models for software engineering corpora

This dissertation focuses on proposing statistical and deep learning models for software engineering corpora to detect bugs in software system. The dissertation aims to solve three main software engineering problems, i.e., bug localization (locating the potential buggy source files in a software pro...

Full description

Saved in:
Bibliographic Details
Main Author: HOANG, Van Duc Thong
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2020
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/307
https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1313&context=etd_coll
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.etd_coll-1313
record_format dspace
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Software Engineering
spellingShingle Software Engineering
HOANG, Van Duc Thong
Statistical and deep learning models for software engineering corpora
description This dissertation focuses on proposing statistical and deep learning models for software engineering corpora to detect bugs in software system. The dissertation aims to solve three main software engineering problems, i.e., bug localization (locating the potential buggy source files in a software project given a bug report or failing test cases), just-in-time defect prediction (identifying the potential defective commits as they are introduced into a version control system), and bug fixing patch identification (identifying commits repairing bugs for their propagation to parallelly maintained versions) to save developers’ time and e↵ort in improving software system quality. Moreover, I also propose a neural network model learning a vector representation of code changes based on their commit messages. The vector representation of code changes contains its semantic intent and can be used to improve the performance of just-in-time defect prediction and bug fixing patch identification. This vector can also be applicable for potentially many other software engineering problems related to code changes, such as tangled change prediction, the recommendation of a code reviewer for a patch, etc. My dissertation develops one statistical model and three deep learning models for various software engineering tasks. The first one introduces a statistical model which is a novel multi-modal approach for bug localization problem. The multi-modal approach is built by utilizing information from both bug reports and program spectra (or program elements) to e↵ectively localize bugs in programs. Di↵erent from other multi-modal approaches for bug localization that treat bug reports (or program elements) as independent, my approach considers similarities between bug reports (or program elements). Hence, similar bugs should have model parameters that are close together. My novel multi-modal approach employs network Lasso regularization to incentivize the model parameters of similar bug reports (or program elements) to be close together. The second one presents a novel deep learning framework to find likely defective code early; the problem is commonly referred to as Just-In-Time (JIT) defect prediction. While most existing JIT defect prediction approaches involve a manual feature engineering step, where researchers propose a number of features extracted from commits (e.g., the number of deleted and added lines, number of files, information of authors and code reviewers, etc.), I introduce an end-to-end deep learning framework, namely DeepJIT, which automatically extracts features from commit messages and code changes in the commits, and then uses them to identify defects. The third one introduces a hierarchical deep learning-based approach, namely PatchNet, to find bug fixing patches in the Linux kernel. Bug fixing patch identification and JIT defect prediction are pretty similar as they take as input the same type of data (i.e., commits to version control systems). While DeepJIT simply merges the removed and added code in the code changes together, PatchNet separates the removed and added code and takes into account the hierarchical structure of the removed and added code. Finally, the last one presents a neural network model, namely CC2Vec, that learns a representation of code changes based on the semantic information in commit messages. Unlike DeepJIT or PatchNet which only solve a specific software engineering task (i.e., just-in-time defect prediction or bug fixing patch identification), the vector representation represents the semantic meaning of the code changes and can be used to solve a number of software engineering problems related to commits (i.e., just-in-time defect prediction, identification of bug fixing patches, and tangled change prediction, etc.).
format text
author HOANG, Van Duc Thong
author_facet HOANG, Van Duc Thong
author_sort HOANG, Van Duc Thong
title Statistical and deep learning models for software engineering corpora
title_short Statistical and deep learning models for software engineering corpora
title_full Statistical and deep learning models for software engineering corpora
title_fullStr Statistical and deep learning models for software engineering corpora
title_full_unstemmed Statistical and deep learning models for software engineering corpora
title_sort statistical and deep learning models for software engineering corpora
publisher Institutional Knowledge at Singapore Management University
publishDate 2020
url https://ink.library.smu.edu.sg/etd_coll/307
https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1313&context=etd_coll
_version_ 1712300951982833664
spelling sg-smu-ink.etd_coll-13132021-03-17T09:14:54Z Statistical and deep learning models for software engineering corpora HOANG, Van Duc Thong This dissertation focuses on proposing statistical and deep learning models for software engineering corpora to detect bugs in software system. The dissertation aims to solve three main software engineering problems, i.e., bug localization (locating the potential buggy source files in a software project given a bug report or failing test cases), just-in-time defect prediction (identifying the potential defective commits as they are introduced into a version control system), and bug fixing patch identification (identifying commits repairing bugs for their propagation to parallelly maintained versions) to save developers’ time and e↵ort in improving software system quality. Moreover, I also propose a neural network model learning a vector representation of code changes based on their commit messages. The vector representation of code changes contains its semantic intent and can be used to improve the performance of just-in-time defect prediction and bug fixing patch identification. This vector can also be applicable for potentially many other software engineering problems related to code changes, such as tangled change prediction, the recommendation of a code reviewer for a patch, etc. My dissertation develops one statistical model and three deep learning models for various software engineering tasks. The first one introduces a statistical model which is a novel multi-modal approach for bug localization problem. The multi-modal approach is built by utilizing information from both bug reports and program spectra (or program elements) to e↵ectively localize bugs in programs. Di↵erent from other multi-modal approaches for bug localization that treat bug reports (or program elements) as independent, my approach considers similarities between bug reports (or program elements). Hence, similar bugs should have model parameters that are close together. My novel multi-modal approach employs network Lasso regularization to incentivize the model parameters of similar bug reports (or program elements) to be close together. The second one presents a novel deep learning framework to find likely defective code early; the problem is commonly referred to as Just-In-Time (JIT) defect prediction. While most existing JIT defect prediction approaches involve a manual feature engineering step, where researchers propose a number of features extracted from commits (e.g., the number of deleted and added lines, number of files, information of authors and code reviewers, etc.), I introduce an end-to-end deep learning framework, namely DeepJIT, which automatically extracts features from commit messages and code changes in the commits, and then uses them to identify defects. The third one introduces a hierarchical deep learning-based approach, namely PatchNet, to find bug fixing patches in the Linux kernel. Bug fixing patch identification and JIT defect prediction are pretty similar as they take as input the same type of data (i.e., commits to version control systems). While DeepJIT simply merges the removed and added code in the code changes together, PatchNet separates the removed and added code and takes into account the hierarchical structure of the removed and added code. Finally, the last one presents a neural network model, namely CC2Vec, that learns a representation of code changes based on the semantic information in commit messages. Unlike DeepJIT or PatchNet which only solve a specific software engineering task (i.e., just-in-time defect prediction or bug fixing patch identification), the vector representation represents the semantic meaning of the code changes and can be used to solve a number of software engineering problems related to commits (i.e., just-in-time defect prediction, identification of bug fixing patches, and tangled change prediction, etc.). 2020-08-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/307 https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1313&context=etd_coll http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University Software Engineering