How to better utilize code graphs in semantic code search?

Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research wor...

Full description

Saved in:
Bibliographic Details
Main Authors: SHI, Yucen, YIN, Ying, WANG, Zhengkui, LO, David, ZHANG, Tao, XIA, Xin, ZHAO, Yuhai, XU, Bowen
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/7734
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8737
record_format dspace
spelling sg-smu-ink.sis_research-87372023-01-10T02:00:04Z How to better utilize code graphs in semantic code search? SHI, Yucen YIN, Ying WANG, Zhengkui LO, David ZHANG, Tao XIA, Xin ZHAO, Yuhai XU, Bowen Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research works (i.e., multi-modal models and pre-trained models) have attempted to incorporate code graphs for code modelling. However, they still have some limitations: First, there is still much room for improvement in terms of search effectiveness. Second, they have not fully considered the unique features of code graphs.In this paper, we propose a Graph-to-Sequence Converter, namely G2SC. Through converting the code graphs into lossless sequences, G2SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs. Thus, the effectiveness of code search can be greatly improved. In particular, G2SC first converts the code graph into a unique corresponding node sequence by a specific graph traversal strategy. Then, it gets a statement sequence by replacing each node with its corresponding statement. A set of carefully designed graph traversal strategies guarantee that the process is one-to-one and reversible. G2SC enables capturing rich semantic relationships (i.e., control flow, data flow, node/relationship properties) and provides learning model-friendly data transformation. It can be flexibly integrated with existing models to better utilize the code graphs. As a proof-of-concept application, we present two G2SC enabled models: GSMM (G2SC enabled multi-modal model) and GSCodeBERT (G2SC enabled CodeBERT model). Extensive experiment results on two real large-scale datasets demonstrate that GSMM and GSCodeBERT can greatly improve the state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, and 63% and 11.5% on MRR, respectively. 2022-11-18T08:00:00Z text https://ink.library.smu.edu.sg/sis_research/7734 info:doi/10.1145/3540250.3549087 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
spellingShingle Databases and Information Systems
SHI, Yucen
YIN, Ying
WANG, Zhengkui
LO, David
ZHANG, Tao
XIA, Xin
ZHAO, Yuhai
XU, Bowen
How to better utilize code graphs in semantic code search?
description Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research works (i.e., multi-modal models and pre-trained models) have attempted to incorporate code graphs for code modelling. However, they still have some limitations: First, there is still much room for improvement in terms of search effectiveness. Second, they have not fully considered the unique features of code graphs.In this paper, we propose a Graph-to-Sequence Converter, namely G2SC. Through converting the code graphs into lossless sequences, G2SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs. Thus, the effectiveness of code search can be greatly improved. In particular, G2SC first converts the code graph into a unique corresponding node sequence by a specific graph traversal strategy. Then, it gets a statement sequence by replacing each node with its corresponding statement. A set of carefully designed graph traversal strategies guarantee that the process is one-to-one and reversible. G2SC enables capturing rich semantic relationships (i.e., control flow, data flow, node/relationship properties) and provides learning model-friendly data transformation. It can be flexibly integrated with existing models to better utilize the code graphs. As a proof-of-concept application, we present two G2SC enabled models: GSMM (G2SC enabled multi-modal model) and GSCodeBERT (G2SC enabled CodeBERT model). Extensive experiment results on two real large-scale datasets demonstrate that GSMM and GSCodeBERT can greatly improve the state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, and 63% and 11.5% on MRR, respectively.
format text
author SHI, Yucen
YIN, Ying
WANG, Zhengkui
LO, David
ZHANG, Tao
XIA, Xin
ZHAO, Yuhai
XU, Bowen
author_facet SHI, Yucen
YIN, Ying
WANG, Zhengkui
LO, David
ZHANG, Tao
XIA, Xin
ZHAO, Yuhai
XU, Bowen
author_sort SHI, Yucen
title How to better utilize code graphs in semantic code search?
title_short How to better utilize code graphs in semantic code search?
title_full How to better utilize code graphs in semantic code search?
title_fullStr How to better utilize code graphs in semantic code search?
title_full_unstemmed How to better utilize code graphs in semantic code search?
title_sort how to better utilize code graphs in semantic code search?
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/sis_research/7734
_version_ 1770576423423049728