Learning program semantics with code representations: An empirical study

Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-a...

Full description

Saved in:
Bibliographic Details
Main Authors: SIOW, Jing Kai, LIU, Shangqing, XIE, Xiaofei, MENG, Guozhu, LIU, Yang
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2022
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/7501
https://ink.library.smu.edu.sg/context/sis_research/article/8504/viewcontent/2203.11790.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8504
record_format dspace
spelling sg-smu-ink.sis_research-85042022-11-21T05:28:08Z Learning program semantics with code representations: An empirical study SIOW, Jing Kai LIU, Shangqing XIE, Xiaofei MENG, Guozhu LIU, Yang Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-art performance. However, currently, a comprehensive and systematic study on evaluating different program representation techniques across diverse tasks is still missed. From this starting point, in this paper, we conduct an empirical study to evaluate different program representation techniques. Specifically, we categorize current mainstream code representation techniques into four categories i.e., Feature-based, Sequence-based, Tree-based, and Graph-based program representation technique and evaluate its performance on three diverse and popular code intelligent tasks i.e., Code Classification, Vulnerability Detection, and Clone Detection on the public released benchmark. We further design three research questions (RQs) and conduct a comprehensive analysis to investigate the performance. By the extensive experimental results, we conclude that (1) The graph-based representation is superior to the other selected techniques across these tasks. (2) Compared with the node type information used in tree-based and graph-based representations, the node textual information is more critical to learning the program semantics. (3) Different tasks require the task-specific semantics to achieve their highest performance, however combining various program semantics from different dimensions such as control dependency, data dependency can still produce promising results. 2022-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7501 info:doi/10.1109/SANER53432.2022.00073 https://ink.library.smu.edu.sg/context/sis_research/article/8504/viewcontent/2203.11790.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Programming Languages and Compilers Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Programming Languages and Compilers
Software Engineering
spellingShingle Programming Languages and Compilers
Software Engineering
SIOW, Jing Kai
LIU, Shangqing
XIE, Xiaofei
MENG, Guozhu
LIU, Yang
Learning program semantics with code representations: An empirical study
description Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-art performance. However, currently, a comprehensive and systematic study on evaluating different program representation techniques across diverse tasks is still missed. From this starting point, in this paper, we conduct an empirical study to evaluate different program representation techniques. Specifically, we categorize current mainstream code representation techniques into four categories i.e., Feature-based, Sequence-based, Tree-based, and Graph-based program representation technique and evaluate its performance on three diverse and popular code intelligent tasks i.e., Code Classification, Vulnerability Detection, and Clone Detection on the public released benchmark. We further design three research questions (RQs) and conduct a comprehensive analysis to investigate the performance. By the extensive experimental results, we conclude that (1) The graph-based representation is superior to the other selected techniques across these tasks. (2) Compared with the node type information used in tree-based and graph-based representations, the node textual information is more critical to learning the program semantics. (3) Different tasks require the task-specific semantics to achieve their highest performance, however combining various program semantics from different dimensions such as control dependency, data dependency can still produce promising results.
format text
author SIOW, Jing Kai
LIU, Shangqing
XIE, Xiaofei
MENG, Guozhu
LIU, Yang
author_facet SIOW, Jing Kai
LIU, Shangqing
XIE, Xiaofei
MENG, Guozhu
LIU, Yang
author_sort SIOW, Jing Kai
title Learning program semantics with code representations: An empirical study
title_short Learning program semantics with code representations: An empirical study
title_full Learning program semantics with code representations: An empirical study
title_fullStr Learning program semantics with code representations: An empirical study
title_full_unstemmed Learning program semantics with code representations: An empirical study
title_sort learning program semantics with code representations: an empirical study
publisher Institutional Knowledge at Singapore Management University
publishDate 2022
url https://ink.library.smu.edu.sg/sis_research/7501
https://ink.library.smu.edu.sg/context/sis_research/article/8504/viewcontent/2203.11790.pdf
_version_ 1770576359183089664