Answer summarization for technical queries: Benchmark and new approach

Prior studies have demonstrated that approaches to generate an answer summary for a given technical query in Software Question and Answer (SQA) sites are desired. We find that existing approaches are assessed solely through user studies. Hence, a new user study needs to be performed every time a new...

Full description

Saved in:
Bibliographic Details
Main Authors: YANG, Chengran, XU, Bowen, THUNG, Ferdian, SHI, Yucen, ZHANG, Ting, YANG, Zhou, ZHOU, Xin, SHI, Jieke, HE, Junda, HAN, DongGyun, David LO
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/7714
https://ink.library.smu.edu.sg/context/sis_research/article/8717/viewcontent/2209.10868.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8717
record_format dspace
spelling sg-smu-ink.sis_research-87172023-09-12T07:38:19Z Answer summarization for technical queries: Benchmark and new approach YANG, Chengran XU, Bowen THUNG, Ferdian SHI, Yucen ZHANG, Ting YANG, Zhou ZHOU, Xin SHI, Jieke HE, Junda HAN, DongGyun David LO, Prior studies have demonstrated that approaches to generate an answer summary for a given technical query in Software Question and Answer (SQA) sites are desired. We find that existing approaches are assessed solely through user studies. Hence, a new user study needs to be performed every time a new approach is introduced; this is time-consuming, slows down the development of the new approach, and results from different user studies may not be comparable to each other. There is a need for a benchmark with ground truth summaries as a complement assessment through user studies. Unfortunately, such a benchmark is non-existent for answer summarization for technical queries from SQA sites. To fill the gap, we manually construct a high-quality benchmark to enable automatic evaluation of answer summarization for the technical queries for SQA sites. It contains 111 query-summary pairs extracted from 382 Stack Overflow answers with 2,014 sentence candidates. Using the benchmark, we comprehensively evaluate the performance of existing approaches and find that there is still a big room for improvements. Motivated by the results, we propose a new approach TechSumBot with three key modules:1) Usefulness Ranking module; 2) Centrality Estimation module; and 3) Redundancy Removal module. We evaluate TechSumBot in both automatic (i.e., using our benchmark) and manual (i.e., via a user study) manners. The results from both evaluations consistently demonstrate that TechSumBot outperforms the best performing baseline approaches from both SE and NLP domains by a large margin, i.e., 10.83%–14.90%, 32.75%–36.59%, and 12.61%–17.54%, in terms of ROUGE-1, ROUGE2, and ROUGE-L on automatic evaluation, and 5.79%–9.23% and 17.03%–17.68%, in terms of average usefulness and diversity score on human evaluation. This highlights that automatic evaluation on our benchmark can uncover findings similar to the ones found through user studies. More importantly, the automatic evaluation has a much lower cost, especially when it is used to assess a new approach. Additionally, we also conducted an ablation study, which demonstrates that each module in TechSumBot contributes to boosting the overall performance of TechSumBot. We release the benchmark as well as the replication package of our experiment at https://github.com/TechSumBot/TechSumBot. 2022-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7714 info:doi/10.1145/3551349.3560421 https://ink.library.smu.edu.sg/context/sis_research/article/8717/viewcontent/2209.10868.pdf http://creativecommons.org/licenses/by/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Summarization Question retrieval Pre-trained models Artificial Intelligence and Robotics Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Summarization
Question retrieval
Pre-trained models
Artificial Intelligence and Robotics
Software Engineering
spellingShingle Summarization
Question retrieval
Pre-trained models
Artificial Intelligence and Robotics
Software Engineering
YANG, Chengran
XU, Bowen
THUNG, Ferdian
SHI, Yucen
ZHANG, Ting
YANG, Zhou
ZHOU, Xin
SHI, Jieke
HE, Junda
HAN, DongGyun
David LO,
Answer summarization for technical queries: Benchmark and new approach
description Prior studies have demonstrated that approaches to generate an answer summary for a given technical query in Software Question and Answer (SQA) sites are desired. We find that existing approaches are assessed solely through user studies. Hence, a new user study needs to be performed every time a new approach is introduced; this is time-consuming, slows down the development of the new approach, and results from different user studies may not be comparable to each other. There is a need for a benchmark with ground truth summaries as a complement assessment through user studies. Unfortunately, such a benchmark is non-existent for answer summarization for technical queries from SQA sites. To fill the gap, we manually construct a high-quality benchmark to enable automatic evaluation of answer summarization for the technical queries for SQA sites. It contains 111 query-summary pairs extracted from 382 Stack Overflow answers with 2,014 sentence candidates. Using the benchmark, we comprehensively evaluate the performance of existing approaches and find that there is still a big room for improvements. Motivated by the results, we propose a new approach TechSumBot with three key modules:1) Usefulness Ranking module; 2) Centrality Estimation module; and 3) Redundancy Removal module. We evaluate TechSumBot in both automatic (i.e., using our benchmark) and manual (i.e., via a user study) manners. The results from both evaluations consistently demonstrate that TechSumBot outperforms the best performing baseline approaches from both SE and NLP domains by a large margin, i.e., 10.83%–14.90%, 32.75%–36.59%, and 12.61%–17.54%, in terms of ROUGE-1, ROUGE2, and ROUGE-L on automatic evaluation, and 5.79%–9.23% and 17.03%–17.68%, in terms of average usefulness and diversity score on human evaluation. This highlights that automatic evaluation on our benchmark can uncover findings similar to the ones found through user studies. More importantly, the automatic evaluation has a much lower cost, especially when it is used to assess a new approach. Additionally, we also conducted an ablation study, which demonstrates that each module in TechSumBot contributes to boosting the overall performance of TechSumBot. We release the benchmark as well as the replication package of our experiment at https://github.com/TechSumBot/TechSumBot.
format text
author YANG, Chengran
XU, Bowen
THUNG, Ferdian
SHI, Yucen
ZHANG, Ting
YANG, Zhou
ZHOU, Xin
SHI, Jieke
HE, Junda
HAN, DongGyun
David LO,
author_facet YANG, Chengran
XU, Bowen
THUNG, Ferdian
SHI, Yucen
ZHANG, Ting
YANG, Zhou
ZHOU, Xin
SHI, Jieke
HE, Junda
HAN, DongGyun
David LO,
author_sort YANG, Chengran
title Answer summarization for technical queries: Benchmark and new approach
title_short Answer summarization for technical queries: Benchmark and new approach
title_full Answer summarization for technical queries: Benchmark and new approach
title_fullStr Answer summarization for technical queries: Benchmark and new approach
title_full_unstemmed Answer summarization for technical queries: Benchmark and new approach
title_sort answer summarization for technical queries: benchmark and new approach
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/sis_research/7714
https://ink.library.smu.edu.sg/context/sis_research/article/8717/viewcontent/2209.10868.pdf
_version_ 1779157129642377216