CodeS: Towards code model generalization under distribution shift

Distribution shift has been a longstanding challenge for the reliable deployment of deep learning (DL) models due to unexpected accuracy degradation. Although DL has been becoming a driving force for large-scale source code analysis in the big code era, limited progress has been made on distribution...

Full description

Saved in:

Bibliographic Details
Main Authors:	HU, Qiang, GUO, Yuejun, XIE, Xiaofei, CORDY, Maxime, MA, Lei, PAPADAKIS, Mike, TRAON, Yves Le
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Benchmark datasets Concrete syntax Driving forces Large scale source Learning models Model generalization Source code analysis Source code learning distribution shift Source codes Time-stamp Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/8244
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9247
record_format	dspace
spelling	sg-smu-ink.sis_research-92472023-10-26T01:36:06Z CodeS: Towards code model generalization under distribution shift HU, Qiang GUO, Yuejun XIE, Xiaofei CORDY, Maxime MA, Lei PAPADAKIS, Mike TRAON, Yves Le Distribution shift has been a longstanding challenge for the reliable deployment of deep learning (DL) models due to unexpected accuracy degradation. Although DL has been becoming a driving force for large-scale source code analysis in the big code era, limited progress has been made on distribution shift analysis and benchmarking for source code tasks. To fill this gap, this paper initiates to propose CodeS, a distribution shift benchmark dataset, for source code learning. Specifically, CodeS supports two programming languages (Java and Python) and five shift types (task, programmer, time-stamp, token, and concrete syntax tree). Extensive experiments based on CodeS reveal that 1) out-of-distribution detectors from other domains (e.g., computer vision) do not generalize to source code, 2) all code classification models suffer from distribution shifts, 3) representation-based shifts have a higher impact on the model than others, and 4) pre-trained bimodal models are relatively more resistant to distribution shifts. 2023-05-20T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/8244 info:doi/10.1109/ICSE-NIER58687.2023.00007 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Benchmark datasets Concrete syntax Driving forces Large scale source Learning models Model generalization Source code analysis Source code learning distribution shift Source codes Time-stamp Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Benchmark datasets Concrete syntax Driving forces Large scale source Learning models Model generalization Source code analysis Source code learning distribution shift Source codes Time-stamp Databases and Information Systems
spellingShingle	Benchmark datasets Concrete syntax Driving forces Large scale source Learning models Model generalization Source code analysis Source code learning distribution shift Source codes Time-stamp Databases and Information Systems HU, Qiang GUO, Yuejun XIE, Xiaofei CORDY, Maxime MA, Lei PAPADAKIS, Mike TRAON, Yves Le CodeS: Towards code model generalization under distribution shift
description	Distribution shift has been a longstanding challenge for the reliable deployment of deep learning (DL) models due to unexpected accuracy degradation. Although DL has been becoming a driving force for large-scale source code analysis in the big code era, limited progress has been made on distribution shift analysis and benchmarking for source code tasks. To fill this gap, this paper initiates to propose CodeS, a distribution shift benchmark dataset, for source code learning. Specifically, CodeS supports two programming languages (Java and Python) and five shift types (task, programmer, time-stamp, token, and concrete syntax tree). Extensive experiments based on CodeS reveal that 1) out-of-distribution detectors from other domains (e.g., computer vision) do not generalize to source code, 2) all code classification models suffer from distribution shifts, 3) representation-based shifts have a higher impact on the model than others, and 4) pre-trained bimodal models are relatively more resistant to distribution shifts.
format	text
author	HU, Qiang GUO, Yuejun XIE, Xiaofei CORDY, Maxime MA, Lei PAPADAKIS, Mike TRAON, Yves Le
author_facet	HU, Qiang GUO, Yuejun XIE, Xiaofei CORDY, Maxime MA, Lei PAPADAKIS, Mike TRAON, Yves Le
author_sort	HU, Qiang
title	CodeS: Towards code model generalization under distribution shift
title_short	CodeS: Towards code model generalization under distribution shift
title_full	CodeS: Towards code model generalization under distribution shift
title_fullStr	CodeS: Towards code model generalization under distribution shift
title_full_unstemmed	CodeS: Towards code model generalization under distribution shift
title_sort	codes: towards code model generalization under distribution shift
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/8244
_version_	1781793972038926336

CodeS: Towards code model generalization under distribution shift

Similar Items