Prediction of relatedness in stack overflow: Deep learning vs. SVM: A reproducibility study

Background Xu et al. used a deep neural network (DNN) technique to classify the degree of relatedness between two knowledge units (question-answer threads) on Stack Overflow. More recently, extending Xu et al.'s work, Fu and Menzies proposed a simpler classification technique based on a fine-tu...

Full description

Saved in:

Bibliographic Details
Main Authors:	XU, Bowen, SHIRANI, Amirreza, LO, David, ALIPOUR, Mohammad Amin
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2018
Subjects:	Relatedness Prediction Deep Learning Support Vector Machine Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/4293 https://ink.library.smu.edu.sg/context/sis_research/article/5296/viewcontent/a21_xu.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-5296
record_format	dspace
spelling	sg-smu-ink.sis_research-52962019-06-06T06:13:02Z Prediction of relatedness in stack overflow: Deep learning vs. SVM: A reproducibility study XU, Bowen SHIRANI, Amirreza LO, David ALIPOUR, Mohammad Amin Background Xu et al. used a deep neural network (DNN) technique to classify the degree of relatedness between two knowledge units (question-answer threads) on Stack Overflow. More recently, extending Xu et al.'s work, Fu and Menzies proposed a simpler classification technique based on a fine-tuned support vector machine (SVM) that achieves similar performance but in a much shorter time. Thus, they suggested that researchers need to compare their sophisticated methods against simpler alternatives.Aim The aim of this work is to replicate the previous studies and further investigate the validity of Fu and Menzies' claim by evaluating the DNN- and SVM-based approaches on a larger dataset. We also compare the effectiveness of these two approaches against SimBow, a lightweight SVM-based method that was previously used for general community question-answering.Method We (1) collect a large dataset containing knowledge units from Stack Overflow, (2) show the value of the new dataset addressing shortcomings of the original one, (3) re-evaluate both the DNN-and SVM-based approaches on the new dataset, and (4) compare the performance of the two approaches against that of SimBow.Results We find that: (1) there are several limitations in the original dataset used in the previous studies, (2) effectiveness of both Xu et al.'s and Fu and Menzies' approaches (as measured using F1-score) drop sharply on the new dataset, (3) similar to the previous finding, performance of SVM-based approaches (Fu and Menzies' approach and SimBow) are slightly better than the DNN-based approach, (4) contrary to the previous findings, Fu and Menzies' approach runs much slower than DNN-based approach on the larger dataset - its runtime grows sharply with increase in dataset size, and (5) SimBow outperforms both Xu et al. and Fu and Menzies' approaches in terms of runtime.Conclusion We conclude that, for this task, simpler approaches based on SVM performs adequately well. We also illustrate the challenges brought by the increased size of the dataset and show the benefit of a lightweight SVM-based approach for this task. 2018-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4293 info:doi/10.1145/3239235.3240503 https://ink.library.smu.edu.sg/context/sis_research/article/5296/viewcontent/a21_xu.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Relatedness Prediction Deep Learning Support Vector Machine Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Relatedness Prediction Deep Learning Support Vector Machine Databases and Information Systems
spellingShingle	Relatedness Prediction Deep Learning Support Vector Machine Databases and Information Systems XU, Bowen SHIRANI, Amirreza LO, David ALIPOUR, Mohammad Amin Prediction of relatedness in stack overflow: Deep learning vs. SVM: A reproducibility study
description	Background Xu et al. used a deep neural network (DNN) technique to classify the degree of relatedness between two knowledge units (question-answer threads) on Stack Overflow. More recently, extending Xu et al.'s work, Fu and Menzies proposed a simpler classification technique based on a fine-tuned support vector machine (SVM) that achieves similar performance but in a much shorter time. Thus, they suggested that researchers need to compare their sophisticated methods against simpler alternatives.Aim The aim of this work is to replicate the previous studies and further investigate the validity of Fu and Menzies' claim by evaluating the DNN- and SVM-based approaches on a larger dataset. We also compare the effectiveness of these two approaches against SimBow, a lightweight SVM-based method that was previously used for general community question-answering.Method We (1) collect a large dataset containing knowledge units from Stack Overflow, (2) show the value of the new dataset addressing shortcomings of the original one, (3) re-evaluate both the DNN-and SVM-based approaches on the new dataset, and (4) compare the performance of the two approaches against that of SimBow.Results We find that: (1) there are several limitations in the original dataset used in the previous studies, (2) effectiveness of both Xu et al.'s and Fu and Menzies' approaches (as measured using F1-score) drop sharply on the new dataset, (3) similar to the previous finding, performance of SVM-based approaches (Fu and Menzies' approach and SimBow) are slightly better than the DNN-based approach, (4) contrary to the previous findings, Fu and Menzies' approach runs much slower than DNN-based approach on the larger dataset - its runtime grows sharply with increase in dataset size, and (5) SimBow outperforms both Xu et al. and Fu and Menzies' approaches in terms of runtime.Conclusion We conclude that, for this task, simpler approaches based on SVM performs adequately well. We also illustrate the challenges brought by the increased size of the dataset and show the benefit of a lightweight SVM-based approach for this task.
format	text
author	XU, Bowen SHIRANI, Amirreza LO, David ALIPOUR, Mohammad Amin
author_facet	XU, Bowen SHIRANI, Amirreza LO, David ALIPOUR, Mohammad Amin
author_sort	XU, Bowen
title	Prediction of relatedness in stack overflow: Deep learning vs. SVM: A reproducibility study
title_short	Prediction of relatedness in stack overflow: Deep learning vs. SVM: A reproducibility study
title_full	Prediction of relatedness in stack overflow: Deep learning vs. SVM: A reproducibility study
title_fullStr	Prediction of relatedness in stack overflow: Deep learning vs. SVM: A reproducibility study
title_full_unstemmed	Prediction of relatedness in stack overflow: Deep learning vs. SVM: A reproducibility study
title_sort	prediction of relatedness in stack overflow: deep learning vs. svm: a reproducibility study
publisher	Institutional Knowledge at Singapore Management University
publishDate	2018
url	https://ink.library.smu.edu.sg/sis_research/4293 https://ink.library.smu.edu.sg/context/sis_research/article/5296/viewcontent/a21_xu.pdf
_version_	1770574602277224448

Prediction of relatedness in stack overflow: Deep learning vs. SVM: A reproducibility study

Similar Items