CodeMatcher: A tool for large-scale code search based on query semantics matching

Due to the emergence of large-scale codebases, such as GitHub and Gitee, searching and reusing existing code can help developers substantially improve software development productivity. Over the years, many code search tools have been developed. Early tools leveraged the information retrieval (IR) t...

Full description

Saved in:
Bibliographic Details
Main Authors: LIU, Chao, BAO, Xuanlin, XIA, Xin, YAN, Meng, LO, David, ZHANG, Ting
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/7728
https://ink.library.smu.edu.sg/context/sis_research/article/8731/viewcontent/CodeMatcher_cp_pv.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8731
record_format dspace
spelling sg-smu-ink.sis_research-87312024-02-16T06:37:58Z CodeMatcher: A tool for large-scale code search based on query semantics matching LIU, Chao BAO, Xuanlin XIA, Xin YAN, Meng LO, David ZHANG, Ting Due to the emergence of large-scale codebases, such as GitHub and Gitee, searching and reusing existing code can help developers substantially improve software development productivity. Over the years, many code search tools have been developed. Early tools leveraged the information retrieval (IR) technique to perform an efficient code search for a frequently changed large-scale codebase. However, the search accuracy was low due to the semantic mismatch between query and code. In the recent years, many tools leveraged Deep Learning (DL) technique to address this issue. But the DL-based tools are slow and the search accuracy is unstable.In this paper, we presented an IR-based tool CodeMatcher, which inherits the advantages of the DL-based tool in query semantics matching. Generally, CodeMatcher builds indexing for a large-scale codebase at first to accelerate the search response time. For a given search query, it addresses irrelevant and noisy words in the query, then retrieves candidate code from the indexed codebase via iterative fuzzy search, and finally reranks the candidates based on two designed measures of semantic matching between query and candidates. We implemented CodeMatcher as a search engine website. To verify the effectiveness of our tool, we evaluated CodeMatcher on 41k+ open-source Java repositories. Experimental results showed that CodeMatcher can achieve an industrial-level response time (0.3s) with a common server with an Intel-i7 CPU. On the search accuracy, CodeMatcher significantly outperforms three state-of-the-art tools (DeepCS, UNIF, and CodeHow) and two online search engines (GitHub search and Google search). 2022-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7728 info:doi/10.1145/3540250.3558935 https://ink.library.smu.edu.sg/context/sis_research/article/8731/viewcontent/CodeMatcher_cp_pv.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Programming Languages and Compilers Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
Programming Languages and Compilers
Software Engineering
spellingShingle Databases and Information Systems
Programming Languages and Compilers
Software Engineering
LIU, Chao
BAO, Xuanlin
XIA, Xin
YAN, Meng
LO, David
ZHANG, Ting
CodeMatcher: A tool for large-scale code search based on query semantics matching
description Due to the emergence of large-scale codebases, such as GitHub and Gitee, searching and reusing existing code can help developers substantially improve software development productivity. Over the years, many code search tools have been developed. Early tools leveraged the information retrieval (IR) technique to perform an efficient code search for a frequently changed large-scale codebase. However, the search accuracy was low due to the semantic mismatch between query and code. In the recent years, many tools leveraged Deep Learning (DL) technique to address this issue. But the DL-based tools are slow and the search accuracy is unstable.In this paper, we presented an IR-based tool CodeMatcher, which inherits the advantages of the DL-based tool in query semantics matching. Generally, CodeMatcher builds indexing for a large-scale codebase at first to accelerate the search response time. For a given search query, it addresses irrelevant and noisy words in the query, then retrieves candidate code from the indexed codebase via iterative fuzzy search, and finally reranks the candidates based on two designed measures of semantic matching between query and candidates. We implemented CodeMatcher as a search engine website. To verify the effectiveness of our tool, we evaluated CodeMatcher on 41k+ open-source Java repositories. Experimental results showed that CodeMatcher can achieve an industrial-level response time (0.3s) with a common server with an Intel-i7 CPU. On the search accuracy, CodeMatcher significantly outperforms three state-of-the-art tools (DeepCS, UNIF, and CodeHow) and two online search engines (GitHub search and Google search).
format text
author LIU, Chao
BAO, Xuanlin
XIA, Xin
YAN, Meng
LO, David
ZHANG, Ting
author_facet LIU, Chao
BAO, Xuanlin
XIA, Xin
YAN, Meng
LO, David
ZHANG, Ting
author_sort LIU, Chao
title CodeMatcher: A tool for large-scale code search based on query semantics matching
title_short CodeMatcher: A tool for large-scale code search based on query semantics matching
title_full CodeMatcher: A tool for large-scale code search based on query semantics matching
title_fullStr CodeMatcher: A tool for large-scale code search based on query semantics matching
title_full_unstemmed CodeMatcher: A tool for large-scale code search based on query semantics matching
title_sort codematcher: a tool for large-scale code search based on query semantics matching
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/sis_research/7728
https://ink.library.smu.edu.sg/context/sis_research/article/8731/viewcontent/CodeMatcher_cp_pv.pdf
_version_ 1794549701246189568