The devil is in the tails: How long-tailed code distributions impact large language models
Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Lea...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/8568 https://ink.library.smu.edu.sg/context/sis_research/article/9571/viewcontent/The_devil_is_in_the_tails.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-9571 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-95712024-01-25T09:00:48Z The devil is in the tails: How long-tailed code distributions impact large language models ZHOU, Xin KIM, Kisub XU, Bowen LIU, Jiakun HAN, DongGyun LO, David Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation. 2023-09-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8568 info:doi/10.1109/ASE56229.2023.00157 https://ink.library.smu.edu.sg/context/sis_research/article/9571/viewcontent/The_devil_is_in_the_tails.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Code distributions Data distribution Data properties Data sample Engineering tasks Language model Learning Based Models Long-tailed distributions Number of class Property Databases and Information Systems Software Engineering |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Code distributions Data distribution Data properties Data sample Engineering tasks Language model Learning Based Models Long-tailed distributions Number of class Property Databases and Information Systems Software Engineering |
spellingShingle |
Code distributions Data distribution Data properties Data sample Engineering tasks Language model Learning Based Models Long-tailed distributions Number of class Property Databases and Information Systems Software Engineering ZHOU, Xin KIM, Kisub XU, Bowen LIU, Jiakun HAN, DongGyun LO, David The devil is in the tails: How long-tailed code distributions impact large language models |
description |
Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation. |
format |
text |
author |
ZHOU, Xin KIM, Kisub XU, Bowen LIU, Jiakun HAN, DongGyun LO, David |
author_facet |
ZHOU, Xin KIM, Kisub XU, Bowen LIU, Jiakun HAN, DongGyun LO, David |
author_sort |
ZHOU, Xin |
title |
The devil is in the tails: How long-tailed code distributions impact large language models |
title_short |
The devil is in the tails: How long-tailed code distributions impact large language models |
title_full |
The devil is in the tails: How long-tailed code distributions impact large language models |
title_fullStr |
The devil is in the tails: How long-tailed code distributions impact large language models |
title_full_unstemmed |
The devil is in the tails: How long-tailed code distributions impact large language models |
title_sort |
devil is in the tails: how long-tailed code distributions impact large language models |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2023 |
url |
https://ink.library.smu.edu.sg/sis_research/8568 https://ink.library.smu.edu.sg/context/sis_research/article/9571/viewcontent/The_devil_is_in_the_tails.pdf |
_version_ |
1789483277748273152 |