Classifying source code: How far can compressor-based classifiers go?

Pre-trained language models of code, which are built upon large-scale datasets, millions of trainable parameters, and high computational resources cost, have achieved phenomenal success. Recently, researchers have proposed a compressor-based classifier (Cbc); it trains no parameters but is found to...

Full description

Saved in:
Bibliographic Details
Main Author: YANG, Zhou
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8920
https://ink.library.smu.edu.sg/context/sis_research/article/9923/viewcontent/3639478.3641229_pvoa_cc_by.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9923
record_format dspace
spelling sg-smu-ink.sis_research-99232024-10-17T06:04:06Z Classifying source code: How far can compressor-based classifiers go? YANG, Zhou Pre-trained language models of code, which are built upon large-scale datasets, millions of trainable parameters, and high computational resources cost, have achieved phenomenal success. Recently, researchers have proposed a compressor-based classifier (Cbc); it trains no parameters but is found to outperform BERT. We conduct the first empirical study to explore whether this lightweight alternative can accurately classify source code. Our study is more than applying Cbc to code-related tasks. We first identify an issue that the original implementation overestimates Cbc. After correction, Cbc's performance on defect prediction drops from 80.7% to 63.0%, which is still comparable to CodeBERT (63.7%). We find that hyperparameter settings affect the performance. Besides, results show that Cbc can outperform CodeBERT when the training data is small, making it a good alternative in low-resource settings. 2024-04-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8920 info:doi/10.1145/3639478.3641229 https://ink.library.smu.edu.sg/context/sis_research/article/9923/viewcontent/3639478.3641229_pvoa_cc_by.pdf http://creativecommons.org/licenses/by/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Defect Software Prediction Efficient Learning Robustness Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Defect Software Prediction
Efficient Learning
Robustness
Software Engineering
spellingShingle Defect Software Prediction
Efficient Learning
Robustness
Software Engineering
YANG, Zhou
Classifying source code: How far can compressor-based classifiers go?
description Pre-trained language models of code, which are built upon large-scale datasets, millions of trainable parameters, and high computational resources cost, have achieved phenomenal success. Recently, researchers have proposed a compressor-based classifier (Cbc); it trains no parameters but is found to outperform BERT. We conduct the first empirical study to explore whether this lightweight alternative can accurately classify source code. Our study is more than applying Cbc to code-related tasks. We first identify an issue that the original implementation overestimates Cbc. After correction, Cbc's performance on defect prediction drops from 80.7% to 63.0%, which is still comparable to CodeBERT (63.7%). We find that hyperparameter settings affect the performance. Besides, results show that Cbc can outperform CodeBERT when the training data is small, making it a good alternative in low-resource settings.
format text
author YANG, Zhou
author_facet YANG, Zhou
author_sort YANG, Zhou
title Classifying source code: How far can compressor-based classifiers go?
title_short Classifying source code: How far can compressor-based classifiers go?
title_full Classifying source code: How far can compressor-based classifiers go?
title_fullStr Classifying source code: How far can compressor-based classifiers go?
title_full_unstemmed Classifying source code: How far can compressor-based classifiers go?
title_sort classifying source code: how far can compressor-based classifiers go?
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/8920
https://ink.library.smu.edu.sg/context/sis_research/article/9923/viewcontent/3639478.3641229_pvoa_cc_by.pdf
_version_ 1814047946855940096