The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification

© 2017 IEEE. Context: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to imbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the effect of these sampling methods...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kwabena Ebo Bennin, Jacky Keung, Akito Monden, Passakorn Phannachitta, Solomon Mensah
Format:	Conference Proceeding
Published:	2018
Subjects:	Computer Science
Online Access:	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85042378748&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/57025
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Chiang Mai University

id	th-cmuir.6653943832-57025
record_format	dspace
spelling	th-cmuir.6653943832-570252018-09-05T03:34:06Z The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification Kwabena Ebo Bennin Jacky Keung Akito Monden Passakorn Phannachitta Solomon Mensah Computer Science © 2017 IEEE. Context: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to imbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the effect of these sampling methods on the classification and prioritization performances of defect prediction models is still unknown. Goal: To investigate the statistical and practical significance of using resampled data for constructing defect prediction models. Method: We examine the practical effects of six data sampling methods on performances of five defect prediction models. The prediction performances of the models trained on default datasets (no sampling method) are compared with that of the models trained on resampled datasets (application of sampling methods). To decide whether the performance changes are significant or not, robust statistical tests are performed and effect sizes computed. Twenty releases of ten open source projects extracted from the PROMISE repository are considered and evaluated using the AUC, pd, pf and G-mean performance measures. Results: There are statistical significant differences and practical effects on the classification performance (pd, pf and G-mean) between models trained on resampled datasets and those trained on the default datasets. However, sampling methods have no statistical and practical effects on defect prioritization performance (AUC) with small or no effect values obtained from the models trained on the resampled datasets. Conclusions: Existing sampling methods can properly set the threshold between buggy and clean samples, while they cannot improve the prediction of defect-proneness itself. Sampling methods are highly recommended for defect classification purposes when all faulty modules are to be considered for testing. 2018-09-05T03:34:06Z 2018-09-05T03:34:06Z 2017-12-07 Conference Proceeding 19493789 19493770 2-s2.0-85042378748 10.1109/ESEM.2017.50 https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85042378748&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/57025
institution	Chiang Mai University
building	Chiang Mai University Library
country	Thailand
collection	CMU Intellectual Repository
topic	Computer Science
spellingShingle	Computer Science Kwabena Ebo Bennin Jacky Keung Akito Monden Passakorn Phannachitta Solomon Mensah The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification
description	© 2017 IEEE. Context: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to imbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the effect of these sampling methods on the classification and prioritization performances of defect prediction models is still unknown. Goal: To investigate the statistical and practical significance of using resampled data for constructing defect prediction models. Method: We examine the practical effects of six data sampling methods on performances of five defect prediction models. The prediction performances of the models trained on default datasets (no sampling method) are compared with that of the models trained on resampled datasets (application of sampling methods). To decide whether the performance changes are significant or not, robust statistical tests are performed and effect sizes computed. Twenty releases of ten open source projects extracted from the PROMISE repository are considered and evaluated using the AUC, pd, pf and G-mean performance measures. Results: There are statistical significant differences and practical effects on the classification performance (pd, pf and G-mean) between models trained on resampled datasets and those trained on the default datasets. However, sampling methods have no statistical and practical effects on defect prioritization performance (AUC) with small or no effect values obtained from the models trained on the resampled datasets. Conclusions: Existing sampling methods can properly set the threshold between buggy and clean samples, while they cannot improve the prediction of defect-proneness itself. Sampling methods are highly recommended for defect classification purposes when all faulty modules are to be considered for testing.
format	Conference Proceeding
author	Kwabena Ebo Bennin Jacky Keung Akito Monden Passakorn Phannachitta Solomon Mensah
author_facet	Kwabena Ebo Bennin Jacky Keung Akito Monden Passakorn Phannachitta Solomon Mensah
author_sort	Kwabena Ebo Bennin
title	The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification
title_short	The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification
title_full	The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification
title_fullStr	The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification
title_full_unstemmed	The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification
title_sort	significant effects of data sampling approaches on software defect prioritization and classification
publishDate	2018
url	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85042378748&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/57025
_version_	1681424801797767168

The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification

Similar Items