Assessing the generalizability of code2vec token embeddings

Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks....

Full description

Saved in:

Bibliographic Details
Main Authors:	KANG, Hong Jin, BISSYANDE, Tegawende F., LO, David
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2019
Subjects:	Code Embeddings Distributed Representations Big Code Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/4493 https://ink.library.smu.edu.sg/context/sis_research/article/5496/viewcontent/ase19_code2vec.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-5496
record_format	dspace
spelling	sg-smu-ink.sis_research-54962024-05-31T07:48:03Z Assessing the generalizability of code2vec token embeddings KANG, Hong Jin BISSYANDE, Tegawende F. LO, David Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks. In contrast, despite recent momentum on word embeddings for source code, the literature lacks evidence of their generalizability beyond the example task they have been trained for. In this experience paper, we identify 3 potential downstream tasks, namely code comments generation, code authorship identification, and code clones detection, that source code token embedding models can be applied to. We empirically assess a recently proposed code token embedding model, namely code2vec’s token embeddings. Code2vec was trained on the task of predicting method names, and while there is potential for using the vectors it learns on other tasks, it has not been explored in literature. Therefore, we fill this gap by focusing on its generalizability for the tasks we have identified. Eventually, we show that source code token embeddings cannot be readily leveraged for the downstream tasks. Our experiments even show that our attempts to use them do not result in any improvements over less sophisticated methods. We call for more research into effective and general use of code embeddings 2019-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4493 info:doi/10.1109/ASE.2019.00011 https://ink.library.smu.edu.sg/context/sis_research/article/5496/viewcontent/ase19_code2vec.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Code Embeddings Distributed Representations Big Code Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Code Embeddings Distributed Representations Big Code Software Engineering
spellingShingle	Code Embeddings Distributed Representations Big Code Software Engineering KANG, Hong Jin BISSYANDE, Tegawende F. LO, David Assessing the generalizability of code2vec token embeddings
description	Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks. In contrast, despite recent momentum on word embeddings for source code, the literature lacks evidence of their generalizability beyond the example task they have been trained for. In this experience paper, we identify 3 potential downstream tasks, namely code comments generation, code authorship identification, and code clones detection, that source code token embedding models can be applied to. We empirically assess a recently proposed code token embedding model, namely code2vec’s token embeddings. Code2vec was trained on the task of predicting method names, and while there is potential for using the vectors it learns on other tasks, it has not been explored in literature. Therefore, we fill this gap by focusing on its generalizability for the tasks we have identified. Eventually, we show that source code token embeddings cannot be readily leveraged for the downstream tasks. Our experiments even show that our attempts to use them do not result in any improvements over less sophisticated methods. We call for more research into effective and general use of code embeddings
format	text
author	KANG, Hong Jin BISSYANDE, Tegawende F. LO, David
author_facet	KANG, Hong Jin BISSYANDE, Tegawende F. LO, David
author_sort	KANG, Hong Jin
title	Assessing the generalizability of code2vec token embeddings
title_short	Assessing the generalizability of code2vec token embeddings
title_full	Assessing the generalizability of code2vec token embeddings
title_fullStr	Assessing the generalizability of code2vec token embeddings
title_full_unstemmed	Assessing the generalizability of code2vec token embeddings
title_sort	assessing the generalizability of code2vec token embeddings
publisher	Institutional Knowledge at Singapore Management University
publishDate	2019
url	https://ink.library.smu.edu.sg/sis_research/4493 https://ink.library.smu.edu.sg/context/sis_research/article/5496/viewcontent/ase19_code2vec.pdf
_version_	1814047558467584000

Assessing the generalizability of code2vec token embeddings

Similar Items