Assessing generalizability of CodeBERT

Pre-trained models like BERT have achieved strong improvements on many natural language processing (NLP) tasks, showing their great generalizability. The success of pre-trained models in NLP inspires pre-trained models for programming language. Recently, CodeBERT, a model for both natural language (...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHOU, Xin, HAN, DongGyun, LO, David
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2021
Subjects:	pre-trained model generalizability CodeBERT Databases and Information Systems Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/6854 https://ink.library.smu.edu.sg/context/sis_research/article/7857/viewcontent/288200a425.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-7857
record_format	dspace
spelling	sg-smu-ink.sis_research-78572024-05-31T07:40:25Z Assessing generalizability of CodeBERT ZHOU, Xin HAN, DongGyun LO, David Pre-trained models like BERT have achieved strong improvements on many natural language processing (NLP) tasks, showing their great generalizability. The success of pre-trained models in NLP inspires pre-trained models for programming language. Recently, CodeBERT, a model for both natural language (NL) and programming language (PL), pre-trained on code search dataset, is proposed. Although promising, CodeBERT has not been evaluated beyond its pre-trained dataset for NL-PL tasks. Also, it has only been shown effective on two tasks that are close in nature to its pre-trained data. This raises two questions: Can CodeBERT generalize beyond its pre-trained data? Can it generalize to various software engineering tasks involving NL and PL? Our work answers these questions by performing an empirical investigation into the generalizability of CodeBERT. First, we assess the generalizability of CodeBERT to datasets other than its pre-training data. Specifically, considering the code search task, we conduct experiments on another dataset containing Python code snippets and their corresponding documentation. We also consider yet another dataset of questions and answers collected from Stack Overflow about Python programming. Second, to assess the generalizability of CodeBERT to various software engineering tasks, we apply CodeBERT to the just-in-time defect prediction task. Our empirical results support the generalizability of CodeBERT on the additional data and task. CodeBERT-based solutions can achieve higher or comparable performance than specialized solutions designed for the code search and just-intime defect prediction tasks. However, the superior performance of the CodeBERT requires a tradeoff; for example, it requires much more computation resources as compared to specialized code search approaches. 2021-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/6854 info:doi/10.1109/ICSME52107.2021.00044 https://ink.library.smu.edu.sg/context/sis_research/article/7857/viewcontent/288200a425.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University pre-trained model generalizability CodeBERT Databases and Information Systems Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	pre-trained model generalizability CodeBERT Databases and Information Systems Software Engineering
spellingShingle	pre-trained model generalizability CodeBERT Databases and Information Systems Software Engineering ZHOU, Xin HAN, DongGyun LO, David Assessing generalizability of CodeBERT
description	Pre-trained models like BERT have achieved strong improvements on many natural language processing (NLP) tasks, showing their great generalizability. The success of pre-trained models in NLP inspires pre-trained models for programming language. Recently, CodeBERT, a model for both natural language (NL) and programming language (PL), pre-trained on code search dataset, is proposed. Although promising, CodeBERT has not been evaluated beyond its pre-trained dataset for NL-PL tasks. Also, it has only been shown effective on two tasks that are close in nature to its pre-trained data. This raises two questions: Can CodeBERT generalize beyond its pre-trained data? Can it generalize to various software engineering tasks involving NL and PL? Our work answers these questions by performing an empirical investigation into the generalizability of CodeBERT. First, we assess the generalizability of CodeBERT to datasets other than its pre-training data. Specifically, considering the code search task, we conduct experiments on another dataset containing Python code snippets and their corresponding documentation. We also consider yet another dataset of questions and answers collected from Stack Overflow about Python programming. Second, to assess the generalizability of CodeBERT to various software engineering tasks, we apply CodeBERT to the just-in-time defect prediction task. Our empirical results support the generalizability of CodeBERT on the additional data and task. CodeBERT-based solutions can achieve higher or comparable performance than specialized solutions designed for the code search and just-intime defect prediction tasks. However, the superior performance of the CodeBERT requires a tradeoff; for example, it requires much more computation resources as compared to specialized code search approaches.
format	text
author	ZHOU, Xin HAN, DongGyun LO, David
author_facet	ZHOU, Xin HAN, DongGyun LO, David
author_sort	ZHOU, Xin
title	Assessing generalizability of CodeBERT
title_short	Assessing generalizability of CodeBERT
title_full	Assessing generalizability of CodeBERT
title_fullStr	Assessing generalizability of CodeBERT
title_full_unstemmed	Assessing generalizability of CodeBERT
title_sort	assessing generalizability of codebert
publisher	Institutional Knowledge at Singapore Management University
publishDate	2021
url	https://ink.library.smu.edu.sg/sis_research/6854 https://ink.library.smu.edu.sg/context/sis_research/article/7857/viewcontent/288200a425.pdf
_version_	1814047562676568064

Assessing generalizability of CodeBERT

Similar Items