Towards More Accurate Retrieval of Duplicate Bug Reports

In a bug tracking system, different testers or users may submit multiple reports on the same bugs, referred to as duplicates, which may cost extra maintenance efforts in triaging and fixing bugs. In order to identify such duplicates accurately, in this paper we propose a retrieval function (REP) to...

Full description

Saved in:

Bibliographic Details
Main Authors:	SUN, Chengnian, LO, David, KHOO, Siau-Cheng, JIANG, Jing
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2011
Subjects:	Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/1402 http://doi.ieeecomputersociety.org/10.1109/ASE.2011.6100061
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-2401
record_format	dspace
spelling	sg-smu-ink.sis_research-24012012-12-07T08:57:32Z Towards More Accurate Retrieval of Duplicate Bug Reports SUN, Chengnian LO, David KHOO, Siau-Cheng JIANG, Jing In a bug tracking system, different testers or users may submit multiple reports on the same bugs, referred to as duplicates, which may cost extra maintenance efforts in triaging and fixing bugs. In order to identify such duplicates accurately, in this paper we propose a retrieval function (REP) to measure the similarity between two bug reports. It fully utilizes the information available in a bug report including not only the similarity of textual content in summary and description fields, but also similarity of non-textual fields such as product, component, version, etc. For more accurate measurement of textual similarity, we extend BM25F – an effective similarity formula in information retrieval community, specially for duplicate report retrieval. Lastly we use a two-round stochastic gradient descent to automatically optimize REP for specific bug repositories in a supervised learning manner. We have validated our technique on three large software bug repositories from Mozilla, Eclipse and OpenOffice. The experiments show 10–27% relative improvement in recall rate@k and 17–23% relative improvement in mean average precision over our previous model. We also applied our technique to a very large dataset consisting of 209,058 reports from Eclipse, resulting in a recall rate@k of 37–71% and mean average precision of 47%. 2011-11-01T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/1402 info:doi/10.1109/ASE.2011.6100061 http://doi.ieeecomputersociety.org/10.1109/ASE.2011.6100061 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Software Engineering
spellingShingle	Software Engineering SUN, Chengnian LO, David KHOO, Siau-Cheng JIANG, Jing Towards More Accurate Retrieval of Duplicate Bug Reports
description	In a bug tracking system, different testers or users may submit multiple reports on the same bugs, referred to as duplicates, which may cost extra maintenance efforts in triaging and fixing bugs. In order to identify such duplicates accurately, in this paper we propose a retrieval function (REP) to measure the similarity between two bug reports. It fully utilizes the information available in a bug report including not only the similarity of textual content in summary and description fields, but also similarity of non-textual fields such as product, component, version, etc. For more accurate measurement of textual similarity, we extend BM25F – an effective similarity formula in information retrieval community, specially for duplicate report retrieval. Lastly we use a two-round stochastic gradient descent to automatically optimize REP for specific bug repositories in a supervised learning manner. We have validated our technique on three large software bug repositories from Mozilla, Eclipse and OpenOffice. The experiments show 10–27% relative improvement in recall rate@k and 17–23% relative improvement in mean average precision over our previous model. We also applied our technique to a very large dataset consisting of 209,058 reports from Eclipse, resulting in a recall rate@k of 37–71% and mean average precision of 47%.
format	text
author	SUN, Chengnian LO, David KHOO, Siau-Cheng JIANG, Jing
author_facet	SUN, Chengnian LO, David KHOO, Siau-Cheng JIANG, Jing
author_sort	SUN, Chengnian
title	Towards More Accurate Retrieval of Duplicate Bug Reports
title_short	Towards More Accurate Retrieval of Duplicate Bug Reports
title_full	Towards More Accurate Retrieval of Duplicate Bug Reports
title_fullStr	Towards More Accurate Retrieval of Duplicate Bug Reports
title_full_unstemmed	Towards More Accurate Retrieval of Duplicate Bug Reports
title_sort	towards more accurate retrieval of duplicate bug reports
publisher	Institutional Knowledge at Singapore Management University
publishDate	2011
url	https://ink.library.smu.edu.sg/sis_research/1402 http://doi.ieeecomputersociety.org/10.1109/ASE.2011.6100061
_version_	1770571108592910336

Towards More Accurate Retrieval of Duplicate Bug Reports

Similar Items