HYDRA: Massively compositional model for cross-project defect prediction

Most software defect prediction approaches are trained and applied on data from the same project. However, often a new project does not have enough training data. Cross-project defect prediction, which uses data from other projects to predict defects in a particular project, provides a new perspecti...

Full description

Saved in:

Bibliographic Details
Main Authors:	XIA, Xin, David LO, PAN, Sinno Jialin, NAGAPPAN, Nachiappan, WANG, Xinyu
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2016
Subjects:	Ensemble Learning Cross-project Defect Prediction Transfer Learning Genetic Algorithm Computer Sciences Software Engineering Theory and Algorithms
Online Access:	https://ink.library.smu.edu.sg/sis_research/3415 https://ink.library.smu.edu.sg/context/sis_research/article/4416/viewcontent/HYDRA_Massively_2016_afv.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-4416
record_format	dspace
spelling	sg-smu-ink.sis_research-44162017-03-31T09:31:09Z HYDRA: Massively compositional model for cross-project defect prediction XIA, Xin David LO, PAN, Sinno Jialin NAGAPPAN, Nachiappan WANG, Xinyu Most software defect prediction approaches are trained and applied on data from the same project. However, often a new project does not have enough training data. Cross-project defect prediction, which uses data from other projects to predict defects in a particular project, provides a new perspective to defect prediction. In this work, we propose a HYbrid moDel Reconstruction Approach (HYDRA) for cross-project defect prediction, which includes two phases: genetic algorithm (GA) phase and ensemble learning (EL) phase. These two phases create a massive composition of classifiers. To examine the benefits of HYDRA, we perform experiments on 29 datasets from the PROMISE repository which contains a total of 11,196 instances (i.e., Java classes) labeled as defective or clean. We experiment with logistic regression as the underlying classification algorithm of HYDRA. We compare our approach with the most recently proposed cross-project defect prediction approaches: TCA+ by Nam et al., Peters filter by Peters et al., GP by Liu et al., MO by Canfora et al., and CODEP by Panichella et al. Our results show that HYDRA achieves an average F1-score of 0.544. On average, across the 29 datasets, these results correspond to an improvement in the F1-scores of 26.22%, 34.99%, 47.43%, 28.61%, and 30.14% over TCA+, Peters filter, GP, MO, and CODEP, respectively. In addition, HYDRA on average can discover 33% of all bugs if developers inspect the top 20% lines of code, which improves the best baseline approach (TCA+) by 44.41%. We also find that HYDRA improves the F1-score of Zero-R which predict all the instances to be defective by 5.42%, but improves Zero-R by 58.65% when inspecting the top 20% lines of code. In practice, Zero-R can be hard to use since it simply predicts all of the instances to be defective, and thus developers have to inspect all of the instances to find the defective ones. Moreover, we notice the improvement of HYDRA over other baseline approaches in terms of F1-score and when inspecting the top 20% lines of code are substantial, and in most cases the improvements are significant and have large effect sizes across the 29 datasets. 2016-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/3415 info:doi/10.1109/TSE.2016.2543218 https://ink.library.smu.edu.sg/context/sis_research/article/4416/viewcontent/HYDRA_Massively_2016_afv.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Ensemble Learning Cross-project Defect Prediction Transfer Learning Genetic Algorithm Computer Sciences Software Engineering Theory and Algorithms
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Ensemble Learning Cross-project Defect Prediction Transfer Learning Genetic Algorithm Computer Sciences Software Engineering Theory and Algorithms
spellingShingle	Ensemble Learning Cross-project Defect Prediction Transfer Learning Genetic Algorithm Computer Sciences Software Engineering Theory and Algorithms XIA, Xin David LO, PAN, Sinno Jialin NAGAPPAN, Nachiappan WANG, Xinyu HYDRA: Massively compositional model for cross-project defect prediction
description	Most software defect prediction approaches are trained and applied on data from the same project. However, often a new project does not have enough training data. Cross-project defect prediction, which uses data from other projects to predict defects in a particular project, provides a new perspective to defect prediction. In this work, we propose a HYbrid moDel Reconstruction Approach (HYDRA) for cross-project defect prediction, which includes two phases: genetic algorithm (GA) phase and ensemble learning (EL) phase. These two phases create a massive composition of classifiers. To examine the benefits of HYDRA, we perform experiments on 29 datasets from the PROMISE repository which contains a total of 11,196 instances (i.e., Java classes) labeled as defective or clean. We experiment with logistic regression as the underlying classification algorithm of HYDRA. We compare our approach with the most recently proposed cross-project defect prediction approaches: TCA+ by Nam et al., Peters filter by Peters et al., GP by Liu et al., MO by Canfora et al., and CODEP by Panichella et al. Our results show that HYDRA achieves an average F1-score of 0.544. On average, across the 29 datasets, these results correspond to an improvement in the F1-scores of 26.22%, 34.99%, 47.43%, 28.61%, and 30.14% over TCA+, Peters filter, GP, MO, and CODEP, respectively. In addition, HYDRA on average can discover 33% of all bugs if developers inspect the top 20% lines of code, which improves the best baseline approach (TCA+) by 44.41%. We also find that HYDRA improves the F1-score of Zero-R which predict all the instances to be defective by 5.42%, but improves Zero-R by 58.65% when inspecting the top 20% lines of code. In practice, Zero-R can be hard to use since it simply predicts all of the instances to be defective, and thus developers have to inspect all of the instances to find the defective ones. Moreover, we notice the improvement of HYDRA over other baseline approaches in terms of F1-score and when inspecting the top 20% lines of code are substantial, and in most cases the improvements are significant and have large effect sizes across the 29 datasets.
format	text
author	XIA, Xin David LO, PAN, Sinno Jialin NAGAPPAN, Nachiappan WANG, Xinyu
author_facet	XIA, Xin David LO, PAN, Sinno Jialin NAGAPPAN, Nachiappan WANG, Xinyu
author_sort	XIA, Xin
title	HYDRA: Massively compositional model for cross-project defect prediction
title_short	HYDRA: Massively compositional model for cross-project defect prediction
title_full	HYDRA: Massively compositional model for cross-project defect prediction
title_fullStr	HYDRA: Massively compositional model for cross-project defect prediction
title_full_unstemmed	HYDRA: Massively compositional model for cross-project defect prediction
title_sort	hydra: massively compositional model for cross-project defect prediction
publisher	Institutional Knowledge at Singapore Management University
publishDate	2016
url	https://ink.library.smu.edu.sg/sis_research/3415 https://ink.library.smu.edu.sg/context/sis_research/article/4416/viewcontent/HYDRA_Massively_2016_afv.pdf
_version_	1770573193646440448

HYDRA: Massively compositional model for cross-project defect prediction

Similar Items