Statistical and deep learning models for software engineering corpora
This dissertation focuses on proposing statistical and deep learning models for software engineering corpora to detect bugs in software system. The dissertation aims to solve three main software engineering problems, i.e., bug localization (locating the potential buggy source files in a software pro...
Saved in:
Main Author: | |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2020
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/etd_coll/307 https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1313&context=etd_coll |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
Summary: | This dissertation focuses on proposing statistical and deep learning models for software engineering corpora to detect bugs in software system. The dissertation aims to solve three main software engineering problems, i.e., bug localization (locating the potential buggy source files in a software project given a bug report or failing test cases), just-in-time defect prediction (identifying the potential defective commits as they are introduced into a version control system), and bug fixing patch identification (identifying commits repairing bugs for their propagation to parallelly maintained versions) to save developers’ time and e↵ort in improving software system quality. Moreover, I also propose a neural network model learning a vector representation of code changes based on their commit messages. The vector representation of code changes contains its semantic intent and can be used to improve the performance of just-in-time defect prediction and bug fixing patch identification. This vector can also be applicable for potentially many other software engineering problems related to code changes, such as tangled change prediction, the recommendation of a code reviewer for a patch, etc.
My dissertation develops one statistical model and three deep learning models for various software engineering tasks. The first one introduces a statistical model which is a novel multi-modal approach for bug localization problem. The multi-modal approach is built by utilizing information from both bug reports and program spectra (or program elements) to e↵ectively localize bugs in programs. Di↵erent from other multi-modal approaches for bug localization that treat bug reports (or program elements) as independent, my approach considers similarities between bug reports (or program elements). Hence, similar bugs should have model parameters that are close together. My novel multi-modal approach employs network Lasso regularization to incentivize the model parameters of similar bug reports (or program elements) to be close together.
The second one presents a novel deep learning framework to find likely defective code early; the problem is commonly referred to as Just-In-Time (JIT) defect prediction. While most existing JIT defect prediction approaches involve a manual feature engineering step, where researchers propose a number of features extracted from commits (e.g., the number of deleted and added lines, number of files, information of authors and code reviewers, etc.), I introduce an end-to-end deep learning framework, namely DeepJIT, which automatically extracts features from commit messages and code changes in the commits, and then uses them to identify defects.
The third one introduces a hierarchical deep learning-based approach, namely PatchNet, to find bug fixing patches in the Linux kernel. Bug fixing patch identification and JIT defect prediction are pretty similar as they take as input the same type of data (i.e., commits to version control systems). While DeepJIT simply merges the removed and added code in the code changes together, PatchNet separates the removed and added code and takes into account the hierarchical structure of the removed and added code.
Finally, the last one presents a neural network model, namely CC2Vec, that learns a representation of code changes based on the semantic information in commit messages. Unlike DeepJIT or PatchNet which only solve a specific software engineering task (i.e., just-in-time defect prediction or bug fixing patch identification), the vector representation represents the semantic meaning of the code changes and can be used to solve a number of software engineering problems related to commits (i.e., just-in-time defect prediction, identification of bug fixing patches, and tangled change prediction, etc.). |
---|