Classifying source code: How far can compressor-based classifiers go?

Pre-trained language models of code, which are built upon large-scale datasets, millions of trainable parameters, and high computational resources cost, have achieved phenomenal success. Recently, researchers have proposed a compressor-based classifier (Cbc); it trains no parameters but is found to...

Full description

Saved in:

Bibliographic Details
Main Author:	YANG, Zhou
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Defect Software Prediction Efficient Learning Robustness Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/8920 https://ink.library.smu.edu.sg/context/sis_research/article/9923/viewcontent/3639478.3641229_pvoa_cc_by.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9923
record_format	dspace
spelling	sg-smu-ink.sis_research-99232024-10-17T06:04:06Z Classifying source code: How far can compressor-based classifiers go? YANG, Zhou Pre-trained language models of code, which are built upon large-scale datasets, millions of trainable parameters, and high computational resources cost, have achieved phenomenal success. Recently, researchers have proposed a compressor-based classifier (Cbc); it trains no parameters but is found to outperform BERT. We conduct the first empirical study to explore whether this lightweight alternative can accurately classify source code. Our study is more than applying Cbc to code-related tasks. We first identify an issue that the original implementation overestimates Cbc. After correction, Cbc's performance on defect prediction drops from 80.7% to 63.0%, which is still comparable to CodeBERT (63.7%). We find that hyperparameter settings affect the performance. Besides, results show that Cbc can outperform CodeBERT when the training data is small, making it a good alternative in low-resource settings. 2024-04-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8920 info:doi/10.1145/3639478.3641229 https://ink.library.smu.edu.sg/context/sis_research/article/9923/viewcontent/3639478.3641229_pvoa_cc_by.pdf http://creativecommons.org/licenses/by/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Defect Software Prediction Efficient Learning Robustness Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Defect Software Prediction Efficient Learning Robustness Software Engineering
spellingShingle	Defect Software Prediction Efficient Learning Robustness Software Engineering YANG, Zhou Classifying source code: How far can compressor-based classifiers go?
description	Pre-trained language models of code, which are built upon large-scale datasets, millions of trainable parameters, and high computational resources cost, have achieved phenomenal success. Recently, researchers have proposed a compressor-based classifier (Cbc); it trains no parameters but is found to outperform BERT. We conduct the first empirical study to explore whether this lightweight alternative can accurately classify source code. Our study is more than applying Cbc to code-related tasks. We first identify an issue that the original implementation overestimates Cbc. After correction, Cbc's performance on defect prediction drops from 80.7% to 63.0%, which is still comparable to CodeBERT (63.7%). We find that hyperparameter settings affect the performance. Besides, results show that Cbc can outperform CodeBERT when the training data is small, making it a good alternative in low-resource settings.
format	text
author	YANG, Zhou
author_facet	YANG, Zhou
author_sort	YANG, Zhou
title	Classifying source code: How far can compressor-based classifiers go?
title_short	Classifying source code: How far can compressor-based classifiers go?
title_full	Classifying source code: How far can compressor-based classifiers go?
title_fullStr	Classifying source code: How far can compressor-based classifiers go?
title_full_unstemmed	Classifying source code: How far can compressor-based classifiers go?
title_sort	classifying source code: how far can compressor-based classifiers go?
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/8920 https://ink.library.smu.edu.sg/context/sis_research/article/9923/viewcontent/3639478.3641229_pvoa_cc_by.pdf
_version_	1814047946855940096

Classifying source code: How far can compressor-based classifiers go?

Similar Items