Learning program semantics with code representations: An empirical study

Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-a...

Full description

Saved in:

Bibliographic Details
Main Authors:	SIOW, Jing Kai, LIU, Shangqing, XIE, Xiaofei, MENG, Guozhu, LIU, Yang
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Programming Languages and Compilers Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/7501 https://ink.library.smu.edu.sg/context/sis_research/article/8504/viewcontent/2203.11790.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-8504
record_format	dspace
spelling	sg-smu-ink.sis_research-85042022-11-21T05:28:08Z Learning program semantics with code representations: An empirical study SIOW, Jing Kai LIU, Shangqing XIE, Xiaofei MENG, Guozhu LIU, Yang Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-art performance. However, currently, a comprehensive and systematic study on evaluating different program representation techniques across diverse tasks is still missed. From this starting point, in this paper, we conduct an empirical study to evaluate different program representation techniques. Specifically, we categorize current mainstream code representation techniques into four categories i.e., Feature-based, Sequence-based, Tree-based, and Graph-based program representation technique and evaluate its performance on three diverse and popular code intelligent tasks i.e., Code Classification, Vulnerability Detection, and Clone Detection on the public released benchmark. We further design three research questions (RQs) and conduct a comprehensive analysis to investigate the performance. By the extensive experimental results, we conclude that (1) The graph-based representation is superior to the other selected techniques across these tasks. (2) Compared with the node type information used in tree-based and graph-based representations, the node textual information is more critical to learning the program semantics. (3) Different tasks require the task-specific semantics to achieve their highest performance, however combining various program semantics from different dimensions such as control dependency, data dependency can still produce promising results. 2022-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7501 info:doi/10.1109/SANER53432.2022.00073 https://ink.library.smu.edu.sg/context/sis_research/article/8504/viewcontent/2203.11790.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Programming Languages and Compilers Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Programming Languages and Compilers Software Engineering
spellingShingle	Programming Languages and Compilers Software Engineering SIOW, Jing Kai LIU, Shangqing XIE, Xiaofei MENG, Guozhu LIU, Yang Learning program semantics with code representations: An empirical study
description	Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-art performance. However, currently, a comprehensive and systematic study on evaluating different program representation techniques across diverse tasks is still missed. From this starting point, in this paper, we conduct an empirical study to evaluate different program representation techniques. Specifically, we categorize current mainstream code representation techniques into four categories i.e., Feature-based, Sequence-based, Tree-based, and Graph-based program representation technique and evaluate its performance on three diverse and popular code intelligent tasks i.e., Code Classification, Vulnerability Detection, and Clone Detection on the public released benchmark. We further design three research questions (RQs) and conduct a comprehensive analysis to investigate the performance. By the extensive experimental results, we conclude that (1) The graph-based representation is superior to the other selected techniques across these tasks. (2) Compared with the node type information used in tree-based and graph-based representations, the node textual information is more critical to learning the program semantics. (3) Different tasks require the task-specific semantics to achieve their highest performance, however combining various program semantics from different dimensions such as control dependency, data dependency can still produce promising results.
format	text
author	SIOW, Jing Kai LIU, Shangqing XIE, Xiaofei MENG, Guozhu LIU, Yang
author_facet	SIOW, Jing Kai LIU, Shangqing XIE, Xiaofei MENG, Guozhu LIU, Yang
author_sort	SIOW, Jing Kai
title	Learning program semantics with code representations: An empirical study
title_short	Learning program semantics with code representations: An empirical study
title_full	Learning program semantics with code representations: An empirical study
title_fullStr	Learning program semantics with code representations: An empirical study
title_full_unstemmed	Learning program semantics with code representations: An empirical study
title_sort	learning program semantics with code representations: an empirical study
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/7501 https://ink.library.smu.edu.sg/context/sis_research/article/8504/viewcontent/2203.11790.pdf
_version_	1770576359183089664

Learning program semantics with code representations: An empirical study

Similar Items