Hierarchical learning of cross-language mappings through distributed vector representations for code

Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to c...

Full description

Saved in:
Bibliographic Details
Main Authors: BUI, Nghi D. Q., JIANG, Lingxiao
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2018
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/4090
https://ink.library.smu.edu.sg/context/sis_research/article/5093/viewcontent/1803.04715.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-5093
record_format dspace
spelling sg-smu-ink.sis_research-50932020-04-01T07:43:07Z Hierarchical learning of cross-language mappings through distributed vector representations for code BUI, Nghi D. Q. JIANG, Lingxiao Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach to automatically learn cross-language representations for various kinds of structural code elements that may be used for program translation. Our key idea is two folded: First, we normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings; Second, hierarchically from bottom up, we construct shared embeddings for code elements of higher levels of granularity (e.g., expressions, statements, methods) from the embeddings for their constituents, and then build mappings among code elements across languages based on similarities among embeddings. Our preliminary evaluations on about 40,000 Java and C# source files from 9 software projects show that our approach can automatically learn shared embeddings for various code elements in different languages and identify their cross-language mappings with reasonable Mean Average Precision scores. When compared with an existing tool for mapping library API methods, our approach identifies many more mappings accurately. The mapping results and code can be accessed at https://github.com/bdqnghi/hierarchical-programming-language-mapping. We believe that our idea for learning cross-language vector representations with code structural information can be a useful step towards automated program translation. 2018-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4090 info:doi/10.1145/3183399.3183427 https://ink.library.smu.edu.sg/context/sis_research/article/5093/viewcontent/1803.04715.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Language mapping Program translation Software maintenance Syntactic structure Word2vec Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Language mapping
Program translation
Software maintenance
Syntactic structure
Word2vec
Software Engineering
spellingShingle Language mapping
Program translation
Software maintenance
Syntactic structure
Word2vec
Software Engineering
BUI, Nghi D. Q.
JIANG, Lingxiao
Hierarchical learning of cross-language mappings through distributed vector representations for code
description Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach to automatically learn cross-language representations for various kinds of structural code elements that may be used for program translation. Our key idea is two folded: First, we normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings; Second, hierarchically from bottom up, we construct shared embeddings for code elements of higher levels of granularity (e.g., expressions, statements, methods) from the embeddings for their constituents, and then build mappings among code elements across languages based on similarities among embeddings. Our preliminary evaluations on about 40,000 Java and C# source files from 9 software projects show that our approach can automatically learn shared embeddings for various code elements in different languages and identify their cross-language mappings with reasonable Mean Average Precision scores. When compared with an existing tool for mapping library API methods, our approach identifies many more mappings accurately. The mapping results and code can be accessed at https://github.com/bdqnghi/hierarchical-programming-language-mapping. We believe that our idea for learning cross-language vector representations with code structural information can be a useful step towards automated program translation.
format text
author BUI, Nghi D. Q.
JIANG, Lingxiao
author_facet BUI, Nghi D. Q.
JIANG, Lingxiao
author_sort BUI, Nghi D. Q.
title Hierarchical learning of cross-language mappings through distributed vector representations for code
title_short Hierarchical learning of cross-language mappings through distributed vector representations for code
title_full Hierarchical learning of cross-language mappings through distributed vector representations for code
title_fullStr Hierarchical learning of cross-language mappings through distributed vector representations for code
title_full_unstemmed Hierarchical learning of cross-language mappings through distributed vector representations for code
title_sort hierarchical learning of cross-language mappings through distributed vector representations for code
publisher Institutional Knowledge at Singapore Management University
publishDate 2018
url https://ink.library.smu.edu.sg/sis_research/4090
https://ink.library.smu.edu.sg/context/sis_research/article/5093/viewcontent/1803.04715.pdf
_version_ 1770574305012219904