Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries

Code search methods, especially those that allow programmers to raise queries in a natural language, plays an important role in software development. It helps to improve programmers' productivity by returning sample code snippets from the Internet and/or source-code repositories for their natur...

Full description

Saved in:

Bibliographic Details
Main Authors:	YAN, Shuhan, YU, Hang, CHEN, Yuting, SHEN, Beijun
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2020
Subjects:	natural-language code search benchmarking empirical study information retrieval machine learning deep learning word embedding Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/5975 https://ink.library.smu.edu.sg/context/sis_research/article/6978/viewcontent/saner20cosbench.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-6978
record_format	dspace
spelling	sg-smu-ink.sis_research-69782021-05-31T07:05:59Z Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries YAN, Shuhan YU, Hang CHEN, Yuting SHEN, Beijun Code search methods, especially those that allow programmers to raise queries in a natural language, plays an important role in software development. It helps to improve programmers' productivity by returning sample code snippets from the Internet and/or source-code repositories for their natural-language queries. Meanwhile, there are many code search methods in the literature that support natural-language queries. Difficulties exist in recognizing the strengths and weaknesses of each method and choosing the right one for different usage scenarios, because (1) the implementations of those methods and the datasets for evaluating them are usually not publicly available, and (2) some methods leverage different training datasets or auxiliary data sources and thus their effectiveness cannot be fairly measured and may be negatively affected in practical uses. To build a common ground for measuring code search methods, this paper builds CosBench, a dataset that consists of 1000 projects, 52 code-independent natural-language queries with ground truths, and a set of scripts for calculating four metrics on code research results. We have evaluated four IR (Information Retrieval)-based and two DL (Deep Learning)-based code search methods on CosBench. The empirical evaluation results clearly show the usefulness of the CosBench dataset and various strengths of each code search method. We found that DL-based methods are more suitable for queries on reusing code, and IR-based ones for queries on resolving bugs and learning API uses. 2020-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/5975 info:doi/10.1109/SANER48275.2020.9054840 https://ink.library.smu.edu.sg/context/sis_research/article/6978/viewcontent/saner20cosbench.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University natural-language code search benchmarking empirical study information retrieval machine learning deep learning word embedding Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	natural-language code search benchmarking empirical study information retrieval machine learning deep learning word embedding Software Engineering
spellingShingle	natural-language code search benchmarking empirical study information retrieval machine learning deep learning word embedding Software Engineering YAN, Shuhan YU, Hang CHEN, Yuting SHEN, Beijun Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries
description	Code search methods, especially those that allow programmers to raise queries in a natural language, plays an important role in software development. It helps to improve programmers' productivity by returning sample code snippets from the Internet and/or source-code repositories for their natural-language queries. Meanwhile, there are many code search methods in the literature that support natural-language queries. Difficulties exist in recognizing the strengths and weaknesses of each method and choosing the right one for different usage scenarios, because (1) the implementations of those methods and the datasets for evaluating them are usually not publicly available, and (2) some methods leverage different training datasets or auxiliary data sources and thus their effectiveness cannot be fairly measured and may be negatively affected in practical uses. To build a common ground for measuring code search methods, this paper builds CosBench, a dataset that consists of 1000 projects, 52 code-independent natural-language queries with ground truths, and a set of scripts for calculating four metrics on code research results. We have evaluated four IR (Information Retrieval)-based and two DL (Deep Learning)-based code search methods on CosBench. The empirical evaluation results clearly show the usefulness of the CosBench dataset and various strengths of each code search method. We found that DL-based methods are more suitable for queries on reusing code, and IR-based ones for queries on resolving bugs and learning API uses.
format	text
author	YAN, Shuhan YU, Hang CHEN, Yuting SHEN, Beijun
author_facet	YAN, Shuhan YU, Hang CHEN, Yuting SHEN, Beijun
author_sort	YAN, Shuhan
title	Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries
title_short	Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries
title_full	Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries
title_fullStr	Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries
title_full_unstemmed	Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries
title_sort	are the code snippets what we are searching for? a benchmark and an empirical study on code search with natural-language queries
publisher	Institutional Knowledge at Singapore Management University
publishDate	2020
url	https://ink.library.smu.edu.sg/sis_research/5975 https://ink.library.smu.edu.sg/context/sis_research/article/6978/viewcontent/saner20cosbench.pdf
_version_	1770575711778635776

Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries

Similar Items