How to better utilize code graphs in semantic code search?
Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research wor...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2022
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/7734 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-8737 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-87372023-01-10T02:00:04Z How to better utilize code graphs in semantic code search? SHI, Yucen YIN, Ying WANG, Zhengkui LO, David ZHANG, Tao XIA, Xin ZHAO, Yuhai XU, Bowen Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research works (i.e., multi-modal models and pre-trained models) have attempted to incorporate code graphs for code modelling. However, they still have some limitations: First, there is still much room for improvement in terms of search effectiveness. Second, they have not fully considered the unique features of code graphs.In this paper, we propose a Graph-to-Sequence Converter, namely G2SC. Through converting the code graphs into lossless sequences, G2SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs. Thus, the effectiveness of code search can be greatly improved. In particular, G2SC first converts the code graph into a unique corresponding node sequence by a specific graph traversal strategy. Then, it gets a statement sequence by replacing each node with its corresponding statement. A set of carefully designed graph traversal strategies guarantee that the process is one-to-one and reversible. G2SC enables capturing rich semantic relationships (i.e., control flow, data flow, node/relationship properties) and provides learning model-friendly data transformation. It can be flexibly integrated with existing models to better utilize the code graphs. As a proof-of-concept application, we present two G2SC enabled models: GSMM (G2SC enabled multi-modal model) and GSCodeBERT (G2SC enabled CodeBERT model). Extensive experiment results on two real large-scale datasets demonstrate that GSMM and GSCodeBERT can greatly improve the state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, and 63% and 11.5% on MRR, respectively. 2022-11-18T08:00:00Z text https://ink.library.smu.edu.sg/sis_research/7734 info:doi/10.1145/3540250.3549087 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Databases and Information Systems |
spellingShingle |
Databases and Information Systems SHI, Yucen YIN, Ying WANG, Zhengkui LO, David ZHANG, Tao XIA, Xin ZHAO, Yuhai XU, Bowen How to better utilize code graphs in semantic code search? |
description |
Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research works (i.e., multi-modal models and pre-trained models) have attempted to incorporate code graphs for code modelling. However, they still have some limitations: First, there is still much room for improvement in terms of search effectiveness. Second, they have not fully considered the unique features of code graphs.In this paper, we propose a Graph-to-Sequence Converter, namely G2SC. Through converting the code graphs into lossless sequences, G2SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs. Thus, the effectiveness of code search can be greatly improved. In particular, G2SC first converts the code graph into a unique corresponding node sequence by a specific graph traversal strategy. Then, it gets a statement sequence by replacing each node with its corresponding statement. A set of carefully designed graph traversal strategies guarantee that the process is one-to-one and reversible. G2SC enables capturing rich semantic relationships (i.e., control flow, data flow, node/relationship properties) and provides learning model-friendly data transformation. It can be flexibly integrated with existing models to better utilize the code graphs. As a proof-of-concept application, we present two G2SC enabled models: GSMM (G2SC enabled multi-modal model) and GSCodeBERT (G2SC enabled CodeBERT model). Extensive experiment results on two real large-scale datasets demonstrate that GSMM and GSCodeBERT can greatly improve the state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, and 63% and 11.5% on MRR, respectively. |
format |
text |
author |
SHI, Yucen YIN, Ying WANG, Zhengkui LO, David ZHANG, Tao XIA, Xin ZHAO, Yuhai XU, Bowen |
author_facet |
SHI, Yucen YIN, Ying WANG, Zhengkui LO, David ZHANG, Tao XIA, Xin ZHAO, Yuhai XU, Bowen |
author_sort |
SHI, Yucen |
title |
How to better utilize code graphs in semantic code search? |
title_short |
How to better utilize code graphs in semantic code search? |
title_full |
How to better utilize code graphs in semantic code search? |
title_fullStr |
How to better utilize code graphs in semantic code search? |
title_full_unstemmed |
How to better utilize code graphs in semantic code search? |
title_sort |
how to better utilize code graphs in semantic code search? |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2022 |
url |
https://ink.library.smu.edu.sg/sis_research/7734 |
_version_ |
1770576423423049728 |