Interpreting CodeBERT for semantic code clone detection

Accurate detection of semantic code clones has many applications in software engineering but is challenging because of lexical, syntactic, or structural dissimilarities in code. CodeBERT, a popular deep neural network based pre-trained code model, can detect code clones with a high accuracy. However...

Full description

Saved in:

Bibliographic Details
Main Authors:	ABID, Shamsa, CAI, Xuemeng, JIANG, Lingxiao
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Codes Correlation Semantics Cloning Predictive models Syntactics Software reliability Explainable AI Model Interpretation Black- box Semantic Clone Detection Code Model Deep Learning Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/9313 https://ink.library.smu.edu.sg/context/sis_research/article/10313/viewcontent/apsec23interpretCodeBERT.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10313
record_format	dspace
spelling	sg-smu-ink.sis_research-103132024-09-26T08:04:21Z Interpreting CodeBERT for semantic code clone detection ABID, Shamsa CAI, Xuemeng JIANG, Lingxiao Accurate detection of semantic code clones has many applications in software engineering but is challenging because of lexical, syntactic, or structural dissimilarities in code. CodeBERT, a popular deep neural network based pre-trained code model, can detect code clones with a high accuracy. However, its performance on unseen data is reported to be lower. A challenge is to interpret CodeBERT's clone detection behavior and isolate the causes of mispredictions. In this paper, we evaluate CodeBERT and interpret its clone detection behavior on the SemanticCloneBench dataset focusing on Java and Python clone pairs. We introduce the use of a black-box model interpretation technique, SHAP, to identify the core features of code that CodeBERT pays attention to for clone prediction. We first perform a manual similarity analysis over a sample of clone pairs to revise clone labels and to assign labels to statements indicating their contribution to core functionality. We then evaluate the correlation between the human and model's interpretation of core features of code as a measure of CodeBERT's trustworthiness. We observe only a weak correlation. Finally, we present examples on how to identify causes of mispredictions for CodeBERT. Our technique can help researchers to assess and fine-tune their models' performance. 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9313 info:doi/10.1109/APSEC60848.2023.00033 https://ink.library.smu.edu.sg/context/sis_research/article/10313/viewcontent/apsec23interpretCodeBERT.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Codes Correlation Semantics Cloning Predictive models Syntactics Software reliability Explainable AI Model Interpretation Black- box Semantic Clone Detection Code Model Deep Learning Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Codes Correlation Semantics Cloning Predictive models Syntactics Software reliability Explainable AI Model Interpretation Black- box Semantic Clone Detection Code Model Deep Learning Software Engineering
spellingShingle	Codes Correlation Semantics Cloning Predictive models Syntactics Software reliability Explainable AI Model Interpretation Black- box Semantic Clone Detection Code Model Deep Learning Software Engineering ABID, Shamsa CAI, Xuemeng JIANG, Lingxiao Interpreting CodeBERT for semantic code clone detection
description	Accurate detection of semantic code clones has many applications in software engineering but is challenging because of lexical, syntactic, or structural dissimilarities in code. CodeBERT, a popular deep neural network based pre-trained code model, can detect code clones with a high accuracy. However, its performance on unseen data is reported to be lower. A challenge is to interpret CodeBERT's clone detection behavior and isolate the causes of mispredictions. In this paper, we evaluate CodeBERT and interpret its clone detection behavior on the SemanticCloneBench dataset focusing on Java and Python clone pairs. We introduce the use of a black-box model interpretation technique, SHAP, to identify the core features of code that CodeBERT pays attention to for clone prediction. We first perform a manual similarity analysis over a sample of clone pairs to revise clone labels and to assign labels to statements indicating their contribution to core functionality. We then evaluate the correlation between the human and model's interpretation of core features of code as a measure of CodeBERT's trustworthiness. We observe only a weak correlation. Finally, we present examples on how to identify causes of mispredictions for CodeBERT. Our technique can help researchers to assess and fine-tune their models' performance.
format	text
author	ABID, Shamsa CAI, Xuemeng JIANG, Lingxiao
author_facet	ABID, Shamsa CAI, Xuemeng JIANG, Lingxiao
author_sort	ABID, Shamsa
title	Interpreting CodeBERT for semantic code clone detection
title_short	Interpreting CodeBERT for semantic code clone detection
title_full	Interpreting CodeBERT for semantic code clone detection
title_fullStr	Interpreting CodeBERT for semantic code clone detection
title_full_unstemmed	Interpreting CodeBERT for semantic code clone detection
title_sort	interpreting codebert for semantic code clone detection
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/9313 https://ink.library.smu.edu.sg/context/sis_research/article/10313/viewcontent/apsec23interpretCodeBERT.pdf
_version_	1814047905767489536

Interpreting CodeBERT for semantic code clone detection

Similar Items