Representation learning for Stack Overflow posts: How far are we?

The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Sta...

Full description

Saved in:
Bibliographic Details
Main Authors: HE, Junda, ZHOU, Xin, XU, Bowen, ZHANG, Ting, KIM, Kisub, YANG, Zhou, Ferdian, Thung, IVANA CLAIRINE IRSAN, David LO
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9232
https://ink.library.smu.edu.sg/context/sis_research/article/10232/viewcontent/3635711.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10232
record_format dspace
spelling sg-smu-ink.sis_research-102322024-09-02T06:51:28Z Representation learning for Stack Overflow posts: How far are we? HE, Junda ZHOU, Xin XU, Bowen ZHANG, Ting KIM, Kisub YANG, Zhou Ferdian, Thung IVANA CLAIRINE IRSAN, David LO, The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks (i.e., tag recommendation, relatedness prediction, and API recommendation). The results show that Post2Vec cannot further improve the SOTA techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, and GPT2) and (2) language models built with software engineering related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the SOTA performance significantly for all the downstream tasks. 2024-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9232 info:doi/10.1145/3635711 https://ink.library.smu.edu.sg/context/sis_research/article/10232/viewcontent/3635711.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Computing methodologies Knowledge representation and reasoning Software and its engineering Software development process management Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Computing methodologies
Knowledge representation and reasoning
Software and its engineering
Software development process management
Software Engineering
spellingShingle Computing methodologies
Knowledge representation and reasoning
Software and its engineering
Software development process management
Software Engineering
HE, Junda
ZHOU, Xin
XU, Bowen
ZHANG, Ting
KIM, Kisub
YANG, Zhou
Ferdian, Thung
IVANA CLAIRINE IRSAN,
David LO,
Representation learning for Stack Overflow posts: How far are we?
description The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks (i.e., tag recommendation, relatedness prediction, and API recommendation). The results show that Post2Vec cannot further improve the SOTA techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, and GPT2) and (2) language models built with software engineering related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the SOTA performance significantly for all the downstream tasks.
format text
author HE, Junda
ZHOU, Xin
XU, Bowen
ZHANG, Ting
KIM, Kisub
YANG, Zhou
Ferdian, Thung
IVANA CLAIRINE IRSAN,
David LO,
author_facet HE, Junda
ZHOU, Xin
XU, Bowen
ZHANG, Ting
KIM, Kisub
YANG, Zhou
Ferdian, Thung
IVANA CLAIRINE IRSAN,
David LO,
author_sort HE, Junda
title Representation learning for Stack Overflow posts: How far are we?
title_short Representation learning for Stack Overflow posts: How far are we?
title_full Representation learning for Stack Overflow posts: How far are we?
title_fullStr Representation learning for Stack Overflow posts: How far are we?
title_full_unstemmed Representation learning for Stack Overflow posts: How far are we?
title_sort representation learning for stack overflow posts: how far are we?
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9232
https://ink.library.smu.edu.sg/context/sis_research/article/10232/viewcontent/3635711.pdf
_version_ 1814047839695667200