Active code learning: Benchmarking sample-efficient training of code models

The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become a...

Full description

Saved in:

Bibliographic Details
Main Authors:	HU, Qiang, GUO, Yuejun, XIE, Xiaofei, CORDY, Maxime, MA, Lei, PAPADAKIS, Mike, TRAON, Yves Le
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Codes Data Models Task Analysis Training Feature Extraction Training Data Labeling Active Learning Machine Learning For Code Benchmark Empirical Analysis Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/8695 https://ink.library.smu.edu.sg/context/sis_research/article/9698/viewcontent/ActiveCodeLearning_av.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9698
record_format	dspace
spelling	sg-smu-ink.sis_research-96982024-03-28T08:41:03Z Active code learning: Benchmarking sample-efficient training of code models HU, Qiang GUO, Yuejun XIE, Xiaofei CORDY, Maxime MA, Lei PAPADAKIS, Mike TRAON, Yves Le The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions (which are used for data selection in active learning) from existing works and adapt them for code-related tasks. Then, we conduct an empirical study to check whether these acquisition functions maintain performance for code data. The results demonstrate that feature selection highly affects active learning and using output vectors to select data is the best choice. For the code summarization task, active code learning is ineffective which produces models with over a 29.64% gap compared to the expected performance. Furthermore, we explore future directions of active code learning with an exploratory study. We propose to replace distance calculation methods with evaluation metrics and find a correlation between these evaluation-based distance methods and the performance of code models. 2024-01-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8695 info:doi/10.1109/TSE.2024.3376964 https://ink.library.smu.edu.sg/context/sis_research/article/9698/viewcontent/ActiveCodeLearning_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Codes Data Models Task Analysis Training Feature Extraction Training Data Labeling Active Learning Machine Learning For Code Benchmark Empirical Analysis Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Codes Data Models Task Analysis Training Feature Extraction Training Data Labeling Active Learning Machine Learning For Code Benchmark Empirical Analysis Software Engineering
spellingShingle	Codes Data Models Task Analysis Training Feature Extraction Training Data Labeling Active Learning Machine Learning For Code Benchmark Empirical Analysis Software Engineering HU, Qiang GUO, Yuejun XIE, Xiaofei CORDY, Maxime MA, Lei PAPADAKIS, Mike TRAON, Yves Le Active code learning: Benchmarking sample-efficient training of code models
description	The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions (which are used for data selection in active learning) from existing works and adapt them for code-related tasks. Then, we conduct an empirical study to check whether these acquisition functions maintain performance for code data. The results demonstrate that feature selection highly affects active learning and using output vectors to select data is the best choice. For the code summarization task, active code learning is ineffective which produces models with over a 29.64% gap compared to the expected performance. Furthermore, we explore future directions of active code learning with an exploratory study. We propose to replace distance calculation methods with evaluation metrics and find a correlation between these evaluation-based distance methods and the performance of code models.
format	text
author	HU, Qiang GUO, Yuejun XIE, Xiaofei CORDY, Maxime MA, Lei PAPADAKIS, Mike TRAON, Yves Le
author_facet	HU, Qiang GUO, Yuejun XIE, Xiaofei CORDY, Maxime MA, Lei PAPADAKIS, Mike TRAON, Yves Le
author_sort	HU, Qiang
title	Active code learning: Benchmarking sample-efficient training of code models
title_short	Active code learning: Benchmarking sample-efficient training of code models
title_full	Active code learning: Benchmarking sample-efficient training of code models
title_fullStr	Active code learning: Benchmarking sample-efficient training of code models
title_full_unstemmed	Active code learning: Benchmarking sample-efficient training of code models
title_sort	active code learning: benchmarking sample-efficient training of code models
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/8695 https://ink.library.smu.edu.sg/context/sis_research/article/9698/viewcontent/ActiveCodeLearning_av.pdf
_version_	1795302175560171520

Active code learning: Benchmarking sample-efficient training of code models

Similar Items