InferCode: Self-supervised learning of code representations by predicting subtrees

Learning code representations has found many uses in software engineering, such as code classification, code search, code comment generation, and bug prediction. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have be...

Full description

Saved in:

Bibliographic Details
Main Authors:	BUI, Duy Quoc Nghi, YU, Yijun, JIANG, Lingxiao
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2021
Subjects:	code search self supervised code clone detection cross language fine tuning code retrieval unlabel data unlabelled data Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/6716 https://ink.library.smu.edu.sg/context/sis_research/article/7719/viewcontent/ICSE21InferCode_preprint.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-7719
record_format	dspace
spelling	sg-smu-ink.sis_research-77192023-04-04T03:02:12Z InferCode: Self-supervised learning of code representations by predicting subtrees BUI, Duy Quoc Nghi YU, Yijun JIANG, Lingxiao Learning code representations has found many uses in software engineering, such as code classification, code search, code comment generation, and bug prediction. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. To overcome the limitation, this paper proposes InferCode, which adapts the selfsupervised learning idea from natural language processing to the abstract syntax trees (ASTs) of code. The key novelty lies in the training of code representations by predicting subtrees automatically identified from the context of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We have trained an instance of InferCode model using TreeBased Convolutional Neural Network (TBCNN) as the encoder of a large set of Java code. This pre-trained model can then be applied to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Comparing to prior techniques applied to the same downstream tasks, such as code2vec, code2seq, ASTNN, using our pre-trained InferCode model higher performance results are achieved with a significant margin for most of the tasks, including those involving different programming languages. The implementation of InferCode and the trained embeddings are made available at the anonymous link: https://github.com/ICSE21/infercode. 2021-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/6716 info:doi/10.1109/ICSE43902.2021.00109 https://ink.library.smu.edu.sg/context/sis_research/article/7719/viewcontent/ICSE21InferCode_preprint.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University code search self supervised code clone detection cross language fine tuning code retrieval unlabel data unlabelled data Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	code search self supervised code clone detection cross language fine tuning code retrieval unlabel data unlabelled data Software Engineering
spellingShingle	code search self supervised code clone detection cross language fine tuning code retrieval unlabel data unlabelled data Software Engineering BUI, Duy Quoc Nghi YU, Yijun JIANG, Lingxiao InferCode: Self-supervised learning of code representations by predicting subtrees
description	Learning code representations has found many uses in software engineering, such as code classification, code search, code comment generation, and bug prediction. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. To overcome the limitation, this paper proposes InferCode, which adapts the selfsupervised learning idea from natural language processing to the abstract syntax trees (ASTs) of code. The key novelty lies in the training of code representations by predicting subtrees automatically identified from the context of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We have trained an instance of InferCode model using TreeBased Convolutional Neural Network (TBCNN) as the encoder of a large set of Java code. This pre-trained model can then be applied to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Comparing to prior techniques applied to the same downstream tasks, such as code2vec, code2seq, ASTNN, using our pre-trained InferCode model higher performance results are achieved with a significant margin for most of the tasks, including those involving different programming languages. The implementation of InferCode and the trained embeddings are made available at the anonymous link: https://github.com/ICSE21/infercode.
format	text
author	BUI, Duy Quoc Nghi YU, Yijun JIANG, Lingxiao
author_facet	BUI, Duy Quoc Nghi YU, Yijun JIANG, Lingxiao
author_sort	BUI, Duy Quoc Nghi
title	InferCode: Self-supervised learning of code representations by predicting subtrees
title_short	InferCode: Self-supervised learning of code representations by predicting subtrees
title_full	InferCode: Self-supervised learning of code representations by predicting subtrees
title_fullStr	InferCode: Self-supervised learning of code representations by predicting subtrees
title_full_unstemmed	InferCode: Self-supervised learning of code representations by predicting subtrees
title_sort	infercode: self-supervised learning of code representations by predicting subtrees
publisher	Institutional Knowledge at Singapore Management University
publishDate	2021
url	https://ink.library.smu.edu.sg/sis_research/6716 https://ink.library.smu.edu.sg/context/sis_research/article/7719/viewcontent/ICSE21InferCode_preprint.pdf
_version_	1770576052939128832

InferCode: Self-supervised learning of code representations by predicting subtrees

Similar Items