MAHAKIL:Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction

IEEE Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches address this concern by creating new minority defective modules to balance the class d...

Full description

Saved in:
Bibliographic Details
Main Authors: Kwabena Ebo Bennin, Jacky Keung, Passakorn Phannachitta, Akito Monden, Solomon Mensah
Format: Journal
Published: 2018
Subjects:
Online Access:https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85028936214&origin=inward
http://cmuir.cmu.ac.th/jspui/handle/6653943832/46651
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Chiang Mai University
id th-cmuir.6653943832-46651
record_format dspace
spelling th-cmuir.6653943832-466512018-04-25T07:22:24Z MAHAKIL:Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction Kwabena Ebo Bennin Jacky Keung Passakorn Phannachitta Akito Monden Solomon Mensah Agricultural and Biological Sciences IEEE Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches address this concern by creating new minority defective modules to balance the class distribution before a model is trained. Notwithstanding the successes achieved by these approaches, they mostly result in over-generalization (high rates of false alarms) and generate near-duplicated data instances (less diverse data). In this study, we introduce MAHAKIL, a novel and efficient synthetic oversampling approach for software defect datasets that is based on the chromosomal theory of inheritance. Exploiting this theory, MAHAKIL interprets two distinct sub-classes as parents and generates a new instance that inherits different traits from each parent and contributes to the diversity within the data distribution. We extensively compare MAHAKIL with SMOTE, Borderline-SMOTE, ADASYN, Random Oversampling and the No sampling approach using 20 releases of defect datasets from the PROMISE repository and five prediction models. Our experiments indicate that MAHAKIL improves the prediction performance for all the models and achieves better and more significant pf values than the other oversampling approaches, based on Brunner & #x0027;s statistical significance test and Cliff & #x0027;s effect sizes. Therefore, MAHAKIL is strongly recommended as an efficient alternative for defect prediction models built on highly imbalanced datasets. 2018-04-25T06:59:02Z 2018-04-25T06:59:02Z 2017-07-25 Journal 00985589 2-s2.0-85028936214 10.1109/TSE.2017.2731766 https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85028936214&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/46651
institution Chiang Mai University
building Chiang Mai University Library
country Thailand
collection CMU Intellectual Repository
topic Agricultural and Biological Sciences
spellingShingle Agricultural and Biological Sciences
Kwabena Ebo Bennin
Jacky Keung
Passakorn Phannachitta
Akito Monden
Solomon Mensah
MAHAKIL:Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction
description IEEE Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches address this concern by creating new minority defective modules to balance the class distribution before a model is trained. Notwithstanding the successes achieved by these approaches, they mostly result in over-generalization (high rates of false alarms) and generate near-duplicated data instances (less diverse data). In this study, we introduce MAHAKIL, a novel and efficient synthetic oversampling approach for software defect datasets that is based on the chromosomal theory of inheritance. Exploiting this theory, MAHAKIL interprets two distinct sub-classes as parents and generates a new instance that inherits different traits from each parent and contributes to the diversity within the data distribution. We extensively compare MAHAKIL with SMOTE, Borderline-SMOTE, ADASYN, Random Oversampling and the No sampling approach using 20 releases of defect datasets from the PROMISE repository and five prediction models. Our experiments indicate that MAHAKIL improves the prediction performance for all the models and achieves better and more significant pf values than the other oversampling approaches, based on Brunner & #x0027;s statistical significance test and Cliff & #x0027;s effect sizes. Therefore, MAHAKIL is strongly recommended as an efficient alternative for defect prediction models built on highly imbalanced datasets.
format Journal
author Kwabena Ebo Bennin
Jacky Keung
Passakorn Phannachitta
Akito Monden
Solomon Mensah
author_facet Kwabena Ebo Bennin
Jacky Keung
Passakorn Phannachitta
Akito Monden
Solomon Mensah
author_sort Kwabena Ebo Bennin
title MAHAKIL:Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction
title_short MAHAKIL:Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction
title_full MAHAKIL:Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction
title_fullStr MAHAKIL:Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction
title_full_unstemmed MAHAKIL:Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction
title_sort mahakil:diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction
publishDate 2018
url https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85028936214&origin=inward
http://cmuir.cmu.ac.th/jspui/handle/6653943832/46651
_version_ 1681422914008645632