Siamese: scalable and incremental code clone search via multiple code representations

© 2019, Springer Science+Business Media, LLC, part of Springer Nature. This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform co...

Full description

Saved in:
Bibliographic Details
Main Authors: Chaiyong Ragkhitwetsagul, Jens Krinke
Other Authors: UCL
Format: Article
Published: 2020
Subjects:
Online Access:https://repository.li.mahidol.ac.th/handle/123456789/50617
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Mahidol University
id th-mahidol.50617
record_format dspace
spelling th-mahidol.506172020-01-27T15:18:39Z Siamese: scalable and incremental code clone search via multiple code representations Chaiyong Ragkhitwetsagul Jens Krinke UCL Mahidol University Computer Science © 2019, Springer Science+Business Media, LLC, part of Springer Nature. This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese’s incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis. 2020-01-27T08:18:39Z 2020-01-27T08:18:39Z 2019-08-15 Article Empirical Software Engineering. Vol.24, No.4 (2019), 2236-2284 10.1007/s10664-019-09697-7 15737616 13823256 2-s2.0-85062698648 https://repository.li.mahidol.ac.th/handle/123456789/50617 Mahidol University SCOPUS https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85062698648&origin=inward
institution Mahidol University
building Mahidol University Library
continent Asia
country Thailand
Thailand
content_provider Mahidol University Library
collection Mahidol University Institutional Repository
topic Computer Science
spellingShingle Computer Science
Chaiyong Ragkhitwetsagul
Jens Krinke
Siamese: scalable and incremental code clone search via multiple code representations
description © 2019, Springer Science+Business Media, LLC, part of Springer Nature. This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese’s incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.
author2 UCL
author_facet UCL
Chaiyong Ragkhitwetsagul
Jens Krinke
format Article
author Chaiyong Ragkhitwetsagul
Jens Krinke
author_sort Chaiyong Ragkhitwetsagul
title Siamese: scalable and incremental code clone search via multiple code representations
title_short Siamese: scalable and incremental code clone search via multiple code representations
title_full Siamese: scalable and incremental code clone search via multiple code representations
title_fullStr Siamese: scalable and incremental code clone search via multiple code representations
title_full_unstemmed Siamese: scalable and incremental code clone search via multiple code representations
title_sort siamese: scalable and incremental code clone search via multiple code representations
publishDate 2020
url https://repository.li.mahidol.ac.th/handle/123456789/50617
_version_ 1763494375045726208