Modeling functional similarity in source code with graph-based Siamese networks

Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted in the past to detect clones. A majority of these approache...

Full description

Saved in:
Bibliographic Details
Main Authors: MEHROTRA, Nikita, AGARWAL, Navdha, GUPTA, Piyush, ANAND, Saket, LO, David, PURANDARE, Rahul
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/7658
https://ink.library.smu.edu.sg/context/sis_research/article/8661/viewcontent/2011.11228.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8661
record_format dspace
spelling sg-smu-ink.sis_research-86612023-01-10T03:46:06Z Modeling functional similarity in source code with graph-based Siamese networks MEHROTRA, Nikita AGARWAL, Navdha GUPTA, Piyush ANAND, Saket LO, David PURANDARE, Rahul Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted in the past to detect clones. A majority of these approaches use lexical and syntactic information to detect clones. However, only a few of them target semantic clones. Recently, motivated by the success of deep learning models in other fields, including natural language processing and computer vision, researchers have attempted to adopt deep learning techniques to detect code clones. These approaches use lexical information (tokens) and(or) syntactic structures like abstract syntax trees (ASTs) to detect code clones. However, they do not make sufficient use of the available structural and semantic information hence, limiting their capabilities. This paper addresses the problem of semantic code clone detection using program dependency graphs and geometric neural networks, leveraging the structured syntactic and semantic information. We have developed a prototype tool HOLMES, based on our novel approach and empirically evaluated it on popular code clone benchmarks. Our results show that HOLMES performs considerably better than the other state-of-the-art tool, TBCCD. We also evaluated HOLMES on unseen projects and performed cross dataset experiments to assess the generalizability of HOLMES. Our results affirm that HOLMES outperforms TBCCD since most of the pairs that HOLMES detected were either undetected or suboptimally reported by TBCCD. 2022-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7658 info:doi/10.1109/TSE.2021.3105556 https://ink.library.smu.edu.sg/context/sis_research/article/8661/viewcontent/2011.11228.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Program representation learning Semantic code clones graph-based neural networks siamese neural networks program dependency graphs Graphics and Human Computer Interfaces OS and Networks Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Program representation learning
Semantic code clones
graph-based neural networks
siamese neural networks
program dependency graphs
Graphics and Human Computer Interfaces
OS and Networks
Software Engineering
spellingShingle Program representation learning
Semantic code clones
graph-based neural networks
siamese neural networks
program dependency graphs
Graphics and Human Computer Interfaces
OS and Networks
Software Engineering
MEHROTRA, Nikita
AGARWAL, Navdha
GUPTA, Piyush
ANAND, Saket
LO, David
PURANDARE, Rahul
Modeling functional similarity in source code with graph-based Siamese networks
description Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted in the past to detect clones. A majority of these approaches use lexical and syntactic information to detect clones. However, only a few of them target semantic clones. Recently, motivated by the success of deep learning models in other fields, including natural language processing and computer vision, researchers have attempted to adopt deep learning techniques to detect code clones. These approaches use lexical information (tokens) and(or) syntactic structures like abstract syntax trees (ASTs) to detect code clones. However, they do not make sufficient use of the available structural and semantic information hence, limiting their capabilities. This paper addresses the problem of semantic code clone detection using program dependency graphs and geometric neural networks, leveraging the structured syntactic and semantic information. We have developed a prototype tool HOLMES, based on our novel approach and empirically evaluated it on popular code clone benchmarks. Our results show that HOLMES performs considerably better than the other state-of-the-art tool, TBCCD. We also evaluated HOLMES on unseen projects and performed cross dataset experiments to assess the generalizability of HOLMES. Our results affirm that HOLMES outperforms TBCCD since most of the pairs that HOLMES detected were either undetected or suboptimally reported by TBCCD.
format text
author MEHROTRA, Nikita
AGARWAL, Navdha
GUPTA, Piyush
ANAND, Saket
LO, David
PURANDARE, Rahul
author_facet MEHROTRA, Nikita
AGARWAL, Navdha
GUPTA, Piyush
ANAND, Saket
LO, David
PURANDARE, Rahul
author_sort MEHROTRA, Nikita
title Modeling functional similarity in source code with graph-based Siamese networks
title_short Modeling functional similarity in source code with graph-based Siamese networks
title_full Modeling functional similarity in source code with graph-based Siamese networks
title_fullStr Modeling functional similarity in source code with graph-based Siamese networks
title_full_unstemmed Modeling functional similarity in source code with graph-based Siamese networks
title_sort modeling functional similarity in source code with graph-based siamese networks
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/sis_research/7658
https://ink.library.smu.edu.sg/context/sis_research/article/8661/viewcontent/2011.11228.pdf
_version_ 1770576399522856960