A fine-grained data set and analysis of tangling in bug fixing commits

Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.Objective: We want to improve our understanding of the pr...

Full description

Saved in:
Bibliographic Details
Main Authors: HERBOLD, Steffen, TRAUTSCH, Alexander, LEDEL, Benjamin, AGHAMOHAMMADI, Alireza, GHALEB, Taher Ahmed, KAUR CHAHAL, Kuljit, BOSSENMAIER, Tim, NAGARIA, Bhaveet, MAKEDONSKI, Philip, AHMADABADI, Matin Nili, SZABADOS, Kristóf, SPIEKER, Helge, MADEJA, Matej, HOY, Nathaniel G., TREUDE, Christoph, WANG, Shangwen, RODRÍGUEZ-PÉREZ, Gema, COLOMO-PALACIOS, Ricardo, VERDECCHIA, Roberto, SINGH, Paramvir
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8762
https://ink.library.smu.edu.sg/context/sis_research/article/9765/viewcontent/s10664_021_10083_5.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9765
record_format dspace
spelling sg-smu-ink.sis_research-97652024-05-23T05:41:25Z A fine-grained data set and analysis of tangling in bug fixing commits HERBOLD, Steffen TRAUTSCH, Alexander LEDEL, Benjamin AGHAMOHAMMADI, Alireza GHALEB, Taher Ahmed KAUR CHAHAL, Kuljit BOSSENMAIER, Tim NAGARIA, Bhaveet MAKEDONSKI, Philip AHMADABADI, Matin Nili SZABADOS, Kristóf SPIEKER, Helge MADEJA, Matej HOY, Nathaniel G. TREUDE, Christoph WANG, Shangwen RODRÍGUEZ-PÉREZ, Gema COLOMO-PALACIOS, Ricardo VERDECCHIA, Roberto SINGH, Paramvir Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus.Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case.Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise. 2022-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8762 info:doi/10.1007/s10664-021-10083-5 https://ink.library.smu.edu.sg/context/sis_research/article/9765/viewcontent/s10664_021_10083_5.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University tangled changes tangled commits bug fix manual validation research turk registered report Databases and Information Systems Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic tangled changes
tangled commits
bug fix
manual validation
research turk
registered report
Databases and Information Systems
Software Engineering
spellingShingle tangled changes
tangled commits
bug fix
manual validation
research turk
registered report
Databases and Information Systems
Software Engineering
HERBOLD, Steffen
TRAUTSCH, Alexander
LEDEL, Benjamin
AGHAMOHAMMADI, Alireza
GHALEB, Taher Ahmed
KAUR CHAHAL, Kuljit
BOSSENMAIER, Tim
NAGARIA, Bhaveet
MAKEDONSKI, Philip
AHMADABADI, Matin Nili
SZABADOS, Kristóf
SPIEKER, Helge
MADEJA, Matej
HOY, Nathaniel G.
TREUDE, Christoph
WANG, Shangwen
RODRÍGUEZ-PÉREZ, Gema
COLOMO-PALACIOS, Ricardo
VERDECCHIA, Roberto
SINGH, Paramvir
A fine-grained data set and analysis of tangling in bug fixing commits
description Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus.Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case.Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.
format text
author HERBOLD, Steffen
TRAUTSCH, Alexander
LEDEL, Benjamin
AGHAMOHAMMADI, Alireza
GHALEB, Taher Ahmed
KAUR CHAHAL, Kuljit
BOSSENMAIER, Tim
NAGARIA, Bhaveet
MAKEDONSKI, Philip
AHMADABADI, Matin Nili
SZABADOS, Kristóf
SPIEKER, Helge
MADEJA, Matej
HOY, Nathaniel G.
TREUDE, Christoph
WANG, Shangwen
RODRÍGUEZ-PÉREZ, Gema
COLOMO-PALACIOS, Ricardo
VERDECCHIA, Roberto
SINGH, Paramvir
author_facet HERBOLD, Steffen
TRAUTSCH, Alexander
LEDEL, Benjamin
AGHAMOHAMMADI, Alireza
GHALEB, Taher Ahmed
KAUR CHAHAL, Kuljit
BOSSENMAIER, Tim
NAGARIA, Bhaveet
MAKEDONSKI, Philip
AHMADABADI, Matin Nili
SZABADOS, Kristóf
SPIEKER, Helge
MADEJA, Matej
HOY, Nathaniel G.
TREUDE, Christoph
WANG, Shangwen
RODRÍGUEZ-PÉREZ, Gema
COLOMO-PALACIOS, Ricardo
VERDECCHIA, Roberto
SINGH, Paramvir
author_sort HERBOLD, Steffen
title A fine-grained data set and analysis of tangling in bug fixing commits
title_short A fine-grained data set and analysis of tangling in bug fixing commits
title_full A fine-grained data set and analysis of tangling in bug fixing commits
title_fullStr A fine-grained data set and analysis of tangling in bug fixing commits
title_full_unstemmed A fine-grained data set and analysis of tangling in bug fixing commits
title_sort fine-grained data set and analysis of tangling in bug fixing commits
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/sis_research/8762
https://ink.library.smu.edu.sg/context/sis_research/article/9765/viewcontent/s10664_021_10083_5.pdf
_version_ 1814047521830338560