Data quality matters: A case study on data label correctness for security bug report prediction

In the research of mining software repositories, we need to label a large amount of data to construct a predictive model. The correctness of the labels will affect the performance of a model substantially. However, limited studies have been performed to investigate the impact of mislabeled instances...

Full description

Saved in:

Bibliographic Details
Main Authors:	WU, Xiaoxue, ZHENG, Wei, XIA, Xin, LO, David
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Computer bugs Noise measurement Predictive models Security Chromium Tuning Data models Security bug report prediction data quality label correctness Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/7436 https://ink.library.smu.edu.sg/context/sis_research/article/8439/viewcontent/DataQualityMatters_2022_av.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-8439
record_format	dspace
spelling	sg-smu-ink.sis_research-84392022-10-20T07:48:06Z Data quality matters: A case study on data label correctness for security bug report prediction WU, Xiaoxue ZHENG, Wei XIA, Xin LO, David In the research of mining software repositories, we need to label a large amount of data to construct a predictive model. The correctness of the labels will affect the performance of a model substantially. However, limited studies have been performed to investigate the impact of mislabeled instances on a predictive model. To bridge the gap, in this article, we perform a case study on the security bug report (SBR) prediction. We found five publicly available datasets for SBR prediction contains many mislabeled instances, which lead to the poor performance of SBR prediction models of recent studies (e.g., the work of Peters et al. and Shu et al.). Furthermore, it might mislead the research direction of SBR prediction. In this article, we first improve the label correctness of these five datasets by manually analyzing each bug report, and we find 749 SBRs, which are originally mislabeled as Non-SBRs (NSBRs). We then evaluate the impacts of datasets label correctness by comparing the performance of the classification models on both the noisy (i.e., before our correction) and the clean (i.e., after our correction) datasets. The results show that the cleaned datasets result in improvement in the performance of classification models. The performance of the approaches proposed by Peters et al. and Shu et al. on the clean datasets is much better than on the noisy datasets. Furthermore, with the clean datasets, the simple text classification models could significantly outperform the security keywords-matrix-based approaches applied by Peters et al. and Shu et al. 2022-07-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7436 info:doi/10.1109/TSE.2021.3063727 https://ink.library.smu.edu.sg/context/sis_research/article/8439/viewcontent/DataQualityMatters_2022_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Computer bugs Noise measurement Predictive models Security Chromium Tuning Data models Security bug report prediction data quality label correctness Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Computer bugs Noise measurement Predictive models Security Chromium Tuning Data models Security bug report prediction data quality label correctness Software Engineering
spellingShingle	Computer bugs Noise measurement Predictive models Security Chromium Tuning Data models Security bug report prediction data quality label correctness Software Engineering WU, Xiaoxue ZHENG, Wei XIA, Xin LO, David Data quality matters: A case study on data label correctness for security bug report prediction
description	In the research of mining software repositories, we need to label a large amount of data to construct a predictive model. The correctness of the labels will affect the performance of a model substantially. However, limited studies have been performed to investigate the impact of mislabeled instances on a predictive model. To bridge the gap, in this article, we perform a case study on the security bug report (SBR) prediction. We found five publicly available datasets for SBR prediction contains many mislabeled instances, which lead to the poor performance of SBR prediction models of recent studies (e.g., the work of Peters et al. and Shu et al.). Furthermore, it might mislead the research direction of SBR prediction. In this article, we first improve the label correctness of these five datasets by manually analyzing each bug report, and we find 749 SBRs, which are originally mislabeled as Non-SBRs (NSBRs). We then evaluate the impacts of datasets label correctness by comparing the performance of the classification models on both the noisy (i.e., before our correction) and the clean (i.e., after our correction) datasets. The results show that the cleaned datasets result in improvement in the performance of classification models. The performance of the approaches proposed by Peters et al. and Shu et al. on the clean datasets is much better than on the noisy datasets. Furthermore, with the clean datasets, the simple text classification models could significantly outperform the security keywords-matrix-based approaches applied by Peters et al. and Shu et al.
format	text
author	WU, Xiaoxue ZHENG, Wei XIA, Xin LO, David
author_facet	WU, Xiaoxue ZHENG, Wei XIA, Xin LO, David
author_sort	WU, Xiaoxue
title	Data quality matters: A case study on data label correctness for security bug report prediction
title_short	Data quality matters: A case study on data label correctness for security bug report prediction
title_full	Data quality matters: A case study on data label correctness for security bug report prediction
title_fullStr	Data quality matters: A case study on data label correctness for security bug report prediction
title_full_unstemmed	Data quality matters: A case study on data label correctness for security bug report prediction
title_sort	data quality matters: a case study on data label correctness for security bug report prediction
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/7436 https://ink.library.smu.edu.sg/context/sis_research/article/8439/viewcontent/DataQualityMatters_2022_av.pdf
_version_	1770576349623222272

Data quality matters: A case study on data label correctness for security bug report prediction

Similar Items