CodeMatcher: Searching code based on sequential semantics of important query words

To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR)-based models for code search, but they fail to connect the semantic gap between query and...

Full description

Saved in:

Bibliographic Details
Main Authors:	LIU, Chao, XIA, Xin, LO, David, LIU, Zhiwei, HASSAN, Ahmed E., LI, Shanping
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	code search code indexing mining software repositories information retrieval Databases and Information Systems Programming Languages and Compilers Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/7648 https://ink.library.smu.edu.sg/context/sis_research/article/8651/viewcontent/tosem213.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-8651
record_format	dspace
spelling	sg-smu-ink.sis_research-86512023-01-10T03:49:55Z CodeMatcher: Searching code based on sequential semantics of important query words LIU, Chao XIA, Xin LO, David LIU, Zhiwei HASSAN, Ahmed E. LI, Shanping To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR)-based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning (DL)-based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions. Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this article, we proposed an IR-based model CodeMatcher that inherits the advantages of DeepCS (i.e., the capability of understanding the sequential semantics in important query words), while it can leverage the indexing technique in the IR-based model to accelerate the search response time substantially. CodeMatcher first collects metadata for query words to identify irrelevant/noisy ones, then iteratively performs fuzzy search with important query words on the codebase that is indexed by the Elasticsearch tool and finally reranks a set of returned candidate code according to how the tokens in the candidate code snippet sequentially matched the important words in a query. We verified its effectiveness on a large-scale codebase with ~41K repositories. Experimental results showed that CodeMatcher achieves an MRR (a widely used accuracy measure for code search) of 0.60, outperforming DeepCS, CodeHow, and UNIF by 82%, 62%, and 46%, respectively. Our proposed model is over 1.2K times faster than DeepCS. Moreover, CodeMatcher outperforms two existing online search engines (GitHub and Google search) by 46% and 33%, respectively, in terms of MRR. We also observed that: fusing the advantages of IR-based and DL-based models is promising; improving the quality of method naming helps code search, since method name plays an important role in connecting query and code. 2022-01-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7648 info:doi/10.1145/3465403 https://ink.library.smu.edu.sg/context/sis_research/article/8651/viewcontent/tosem213.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University code search code indexing mining software repositories information retrieval Databases and Information Systems Programming Languages and Compilers Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	code search code indexing mining software repositories information retrieval Databases and Information Systems Programming Languages and Compilers Software Engineering
spellingShingle	code search code indexing mining software repositories information retrieval Databases and Information Systems Programming Languages and Compilers Software Engineering LIU, Chao XIA, Xin LO, David LIU, Zhiwei HASSAN, Ahmed E. LI, Shanping CodeMatcher: Searching code based on sequential semantics of important query words
description	To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR)-based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning (DL)-based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions. Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this article, we proposed an IR-based model CodeMatcher that inherits the advantages of DeepCS (i.e., the capability of understanding the sequential semantics in important query words), while it can leverage the indexing technique in the IR-based model to accelerate the search response time substantially. CodeMatcher first collects metadata for query words to identify irrelevant/noisy ones, then iteratively performs fuzzy search with important query words on the codebase that is indexed by the Elasticsearch tool and finally reranks a set of returned candidate code according to how the tokens in the candidate code snippet sequentially matched the important words in a query. We verified its effectiveness on a large-scale codebase with ~41K repositories. Experimental results showed that CodeMatcher achieves an MRR (a widely used accuracy measure for code search) of 0.60, outperforming DeepCS, CodeHow, and UNIF by 82%, 62%, and 46%, respectively. Our proposed model is over 1.2K times faster than DeepCS. Moreover, CodeMatcher outperforms two existing online search engines (GitHub and Google search) by 46% and 33%, respectively, in terms of MRR. We also observed that: fusing the advantages of IR-based and DL-based models is promising; improving the quality of method naming helps code search, since method name plays an important role in connecting query and code.
format	text
author	LIU, Chao XIA, Xin LO, David LIU, Zhiwei HASSAN, Ahmed E. LI, Shanping
author_facet	LIU, Chao XIA, Xin LO, David LIU, Zhiwei HASSAN, Ahmed E. LI, Shanping
author_sort	LIU, Chao
title	CodeMatcher: Searching code based on sequential semantics of important query words
title_short	CodeMatcher: Searching code based on sequential semantics of important query words
title_full	CodeMatcher: Searching code based on sequential semantics of important query words
title_fullStr	CodeMatcher: Searching code based on sequential semantics of important query words
title_full_unstemmed	CodeMatcher: Searching code based on sequential semantics of important query words
title_sort	codematcher: searching code based on sequential semantics of important query words
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/7648 https://ink.library.smu.edu.sg/context/sis_research/article/8651/viewcontent/tosem213.pdf
_version_	1770576408966332416

CodeMatcher: Searching code based on sequential semantics of important query words

Similar Items