The Devil is in the Tails: How Long-Tailed Code Distributions Impact
Large Language Models
- URL: http://arxiv.org/abs/2309.03567v1
- Date: Thu, 7 Sep 2023 08:53:16 GMT
- Title: The Devil is in the Tails: How Long-Tailed Code Distributions Impact
Large Language Models
- Authors: Xin Zhou, Kisub Kim, Bowen Xu, Jiakun Liu, DongGyun Han, David Lo
- Abstract summary: Learning-based models, including popular Large Language Models for code, heavily rely on data.
Long-tailed distribution has a substantial impact on the effectiveness of LLMs for code.
Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code.
- Score: 15.462819541662752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning-based techniques, especially advanced Large Language Models (LLMs)
for code, have gained considerable popularity in various software engineering
(SE) tasks. However, most existing works focus on designing better
learning-based models and pay less attention to the properties of datasets.
Learning-based models, including popular LLMs for code, heavily rely on data,
and the data's properties (e.g., data distribution) could significantly affect
their behavior. We conducted an exploratory study on the distribution of SE
data and found that such data usually follows a skewed distribution (i.e.,
long-tailed distribution) where a small number of classes have an extensive
collection of samples, while a large number of classes have very few samples.
We investigate three distinct SE tasks and analyze the impacts of long-tailed
distribution on the performance of LLMs for code. Our experimental results
reveal that the long-tailed distribution has a substantial impact on the
effectiveness of LLMs for code. Specifically, LLMs for code perform between
30.0\% and 254.0\% worse on data samples associated with infrequent labels
compared to data samples of frequent labels. Our study provides a better
understanding of the effects of long-tailed distributions on popular LLMs for
code and insights for the future development of SE automation.
Related papers
- Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training.
Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization.
We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z) - Empirical Insights on Fine-Tuning Large Language Models for Question-Answering [50.12622877002846]
Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can be fine-tuned for the question-answering (QA) task.
We categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs.
Our experiments show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task.
arXiv Detail & Related papers (2024-09-24T07:38:38Z) - Enhancing Discriminative Tasks by Guiding the Pre-trained Language Model with Large Language Model's Experience [4.814313782484443]
Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks.
We use LLMs to generate domain-specific data, thereby improving the performance of pre-trained LMs on the target tasks.
arXiv Detail & Related papers (2024-08-16T06:37:59Z) - The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected.
On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data.
To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models [21.10890310571397]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over.
This work introduces a variety of different techniques to assess whether a language model has seen a dataset during training.
We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training.
arXiv Detail & Related papers (2024-04-09T10:58:21Z) - On Inter-dataset Code Duplication and Data Leakage in Large Language Models [4.148857672591562]
This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating large language models (LLMs)
Our findings reveal a potential threat to the evaluation of LLMs across multiple SE tasks, stemming from the inter-dataset code duplication phenomenon.
We provide evidence that open-source models could be affected by inter-dataset duplication.
arXiv Detail & Related papers (2024-01-15T19:46:40Z) - In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search [67.35240346713911]
We take the first step towards evaluating large language models (LLMs) in the long-tail distribution of inferential knowledge.
Link is a systematic long-tail data generation framework, to obtain factually-correct yet long-tail inferential statements.
We then use LINK to curate Logic-Induced-Long-Tail (LINT), a large-scale long-tail inferential knowledge dataset.
arXiv Detail & Related papers (2023-11-13T10:56:59Z) - Ziya2: Data-centric Learning is All LLMs Need [41.44909548662012]
We propose Ziya2, a model with 13 billion parameters adopting LLaMA2 as the foundation model, and further pre-trained on 700 billion tokens.
Experiments show that Ziya2 significantly outperforms other models in multiple benchmarks especially with promising results compared to representative open-source ones.
arXiv Detail & Related papers (2023-11-06T17:49:34Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.