The devil is in the tails: How long-tailed code distributions impact large language models

Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Lea...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHOU, Xin, KIM, Kisub, XU, Bowen, LIU, Jiakun, HAN, DongGyun, LO, David
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8568
https://ink.library.smu.edu.sg/context/sis_research/article/9571/viewcontent/The_devil_is_in_the_tails.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9571
record_format dspace
spelling sg-smu-ink.sis_research-95712024-01-25T09:00:48Z The devil is in the tails: How long-tailed code distributions impact large language models ZHOU, Xin KIM, Kisub XU, Bowen LIU, Jiakun HAN, DongGyun LO, David Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation. 2023-09-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8568 info:doi/10.1109/ASE56229.2023.00157 https://ink.library.smu.edu.sg/context/sis_research/article/9571/viewcontent/The_devil_is_in_the_tails.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Code distributions Data distribution Data properties Data sample Engineering tasks Language model Learning Based Models Long-tailed distributions Number of class Property Databases and Information Systems Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Code distributions
Data distribution
Data properties
Data sample
Engineering tasks
Language model
Learning Based Models
Long-tailed distributions
Number of class
Property
Databases and Information Systems
Software Engineering
spellingShingle Code distributions
Data distribution
Data properties
Data sample
Engineering tasks
Language model
Learning Based Models
Long-tailed distributions
Number of class
Property
Databases and Information Systems
Software Engineering
ZHOU, Xin
KIM, Kisub
XU, Bowen
LIU, Jiakun
HAN, DongGyun
LO, David
The devil is in the tails: How long-tailed code distributions impact large language models
description Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation.
format text
author ZHOU, Xin
KIM, Kisub
XU, Bowen
LIU, Jiakun
HAN, DongGyun
LO, David
author_facet ZHOU, Xin
KIM, Kisub
XU, Bowen
LIU, Jiakun
HAN, DongGyun
LO, David
author_sort ZHOU, Xin
title The devil is in the tails: How long-tailed code distributions impact large language models
title_short The devil is in the tails: How long-tailed code distributions impact large language models
title_full The devil is in the tails: How long-tailed code distributions impact large language models
title_fullStr The devil is in the tails: How long-tailed code distributions impact large language models
title_full_unstemmed The devil is in the tails: How long-tailed code distributions impact large language models
title_sort devil is in the tails: how long-tailed code distributions impact large language models
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8568
https://ink.library.smu.edu.sg/context/sis_research/article/9571/viewcontent/The_devil_is_in_the_tails.pdf
_version_ 1789483277748273152