Multi-Factor Duplicate Question Detection in Stack Overflow

Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack O...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHANG, Yun, David LO, XIA, Xin, SUN, Jian Ling
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2015
Subjects:	duplicate question DupPredictor software information site Stack Overflow Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/3195 https://ink.library.smu.edu.sg/context/sis_research/article/4196/viewcontent/jcst_duplicateqns_av.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-4196
record_format	dspace
spelling	sg-smu-ink.sis_research-41962020-01-11T00:42:10Z Multi-Factor Duplicate Question Detection in Stack Overflow ZHANG, Yun David LO, XIA, Xin SUN, Jian Ling Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack Overflow site maintenance harder, waste resources that could have been used to answer other questions, and cause developers to unnecessarily wait for answers that are already available. To reduce the problem of duplicate questions, Stack Overflow allows questions to be manually marked as duplicates of others. Since there are thousands of questions submitted to Stack Overflow every day, manually identifying duplicate questions is a difficult work. Thus, there is a need for an automated approach that can help in detecting these duplicate questions. To address the above-mentioned need, in this paper, we propose an automated approach named DupPredictor that takes a new question as input and detects potential duplicates of this question by considering multiple factors. DupPredictor extracts the title and description of a question and also tags that are attached to the question. These pieces of information (title, description, and a few tags) are mandatory information that a user needs to input when posting a question. DupPredictor then computes the latent topics of each question by using a topic model. Next, for each pair of questions, it computes four similarity scores by comparing their titles, descriptions, latent topics, and tags. These four similarity scores are finally combined together to result in a new similarity score that comprehensively considers the multiple factors. To examine the benefit of DupPredictor, we perform an experiment on a Stack Overflow dataset which contains a total of more than two million questions. The result shows that DupPredictor can achieve a recall-rate@20 score of 63.8%. We compare our approach with the standard search engine of Stack Overflow, and DupPredictor improves its recall-rate@10 score by 40.63%. We also compare our approach with approaches that only use title, description, topic, and tag similarity and Runeson et al.’s approach that has been used to detect duplicate bug reports, and DupPredictor improves their recall-rate@10 scores by 27.2%, 97.4%, 746.0%, 231.1%, and 16.4% respectively. 2015-09-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/3195 info:doi/10.1007/s11390-015-1576-4 https://ink.library.smu.edu.sg/context/sis_research/article/4196/viewcontent/jcst_duplicateqns_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University duplicate question DupPredictor software information site Stack Overflow Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	duplicate question DupPredictor software information site Stack Overflow Databases and Information Systems
spellingShingle	duplicate question DupPredictor software information site Stack Overflow Databases and Information Systems ZHANG, Yun David LO, XIA, Xin SUN, Jian Ling Multi-Factor Duplicate Question Detection in Stack Overflow
description	Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack Overflow site maintenance harder, waste resources that could have been used to answer other questions, and cause developers to unnecessarily wait for answers that are already available. To reduce the problem of duplicate questions, Stack Overflow allows questions to be manually marked as duplicates of others. Since there are thousands of questions submitted to Stack Overflow every day, manually identifying duplicate questions is a difficult work. Thus, there is a need for an automated approach that can help in detecting these duplicate questions. To address the above-mentioned need, in this paper, we propose an automated approach named DupPredictor that takes a new question as input and detects potential duplicates of this question by considering multiple factors. DupPredictor extracts the title and description of a question and also tags that are attached to the question. These pieces of information (title, description, and a few tags) are mandatory information that a user needs to input when posting a question. DupPredictor then computes the latent topics of each question by using a topic model. Next, for each pair of questions, it computes four similarity scores by comparing their titles, descriptions, latent topics, and tags. These four similarity scores are finally combined together to result in a new similarity score that comprehensively considers the multiple factors. To examine the benefit of DupPredictor, we perform an experiment on a Stack Overflow dataset which contains a total of more than two million questions. The result shows that DupPredictor can achieve a recall-rate@20 score of 63.8%. We compare our approach with the standard search engine of Stack Overflow, and DupPredictor improves its recall-rate@10 score by 40.63%. We also compare our approach with approaches that only use title, description, topic, and tag similarity and Runeson et al.’s approach that has been used to detect duplicate bug reports, and DupPredictor improves their recall-rate@10 scores by 27.2%, 97.4%, 746.0%, 231.1%, and 16.4% respectively.
format	text
author	ZHANG, Yun David LO, XIA, Xin SUN, Jian Ling
author_facet	ZHANG, Yun David LO, XIA, Xin SUN, Jian Ling
author_sort	ZHANG, Yun
title	Multi-Factor Duplicate Question Detection in Stack Overflow
title_short	Multi-Factor Duplicate Question Detection in Stack Overflow
title_full	Multi-Factor Duplicate Question Detection in Stack Overflow
title_fullStr	Multi-Factor Duplicate Question Detection in Stack Overflow
title_full_unstemmed	Multi-Factor Duplicate Question Detection in Stack Overflow
title_sort	multi-factor duplicate question detection in stack overflow
publisher	Institutional Knowledge at Singapore Management University
publishDate	2015
url	https://ink.library.smu.edu.sg/sis_research/3195 https://ink.library.smu.edu.sg/context/sis_research/article/4196/viewcontent/jcst_duplicateqns_av.pdf
_version_	1770572975256371200

Multi-Factor Duplicate Question Detection in Stack Overflow

Similar Items