On Inter-dataset Code Duplication and Data Leakage in Large Language
Models
- URL: http://arxiv.org/abs/2401.07930v1
- Date: Mon, 15 Jan 2024 19:46:40 GMT
- Title: On Inter-dataset Code Duplication and Data Leakage in Large Language
Models
- Authors: Jos\'e Antonio Hern\'andez L\'opez, Boqi Chen, Tushar Sharma, D\'aniel
Varr\'o
- Abstract summary: This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating large language models (LLMs)
We identify the intersection between the pre-training and fine-tuning datasets using a deduplication process.
We fine-tune four models pre-trained on CSN to evaluate their performance on samples encountered during pre-training and those unseen during that phase.
- Score: 5.704848262917858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Motivation. Large language models (LLMs) have exhibited remarkable
proficiency in diverse software engineering (SE) tasks. Handling such tasks
typically involves acquiring foundational coding knowledge on large,
general-purpose datasets during a pre-training phase, and subsequently refining
on smaller, task-specific datasets as part of a fine-tuning phase.
Problem statement. Data leakage is a well-known issue in training of machine
learning models. A manifestation of this issue is the intersection of the
training and testing splits. While intra-dataset code duplication examines this
intersection within a given dataset and has been addressed in prior research,
inter-dataset code duplication, which gauges the overlap between different
datasets, remains largely unexplored. If this phenomenon exists, it could
compromise the integrity of LLM evaluations because of the inclusion of
fine-tuning test samples that were already encountered during pre-training,
resulting in inflated performance metrics.
Contribution. This paper explores the phenomenon of inter-dataset code
duplication and its impact on evaluating LLMs across diverse SE tasks.
Study design. We conduct an empirical study using the CSN dataset, a widely
adopted pre-training dataset, and five fine-tuning datasets used for various SE
tasks. We first identify the intersection between the pre-training and
fine-tuning datasets using a deduplication process. Then, we fine-tune four
models pre-trained on CSN to evaluate their performance on samples encountered
during pre-training and those unseen during that phase.
Results. Our findings reveal a potential threat to the evaluation of various
LLMs across multiple SE tasks, stemming from the inter-dataset code duplication
phenomenon. Moreover, we demonstrate that this threat is accentuated by factors
like the LLM's size and the chosen fine-tuning technique.
Related papers
- Empirical Insights on Fine-Tuning Large Language Models for Question-Answering [50.12622877002846]
Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can be fine-tuned for the question-answering (QA) task.
We categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs.
Our experiments show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task.
arXiv Detail & Related papers (2024-09-24T07:38:38Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting [45.0261082985087]
We conduct a comprehensive evaluation of Large Language Models (LLMs) for temporal event forecasting.
We find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance.
In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance.
arXiv Detail & Related papers (2024-07-16T11:58:54Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs [31.16117964915814]
Machine unlearning, which seeks to erase specific data stored in the pre-trained or fine-tuned models, has emerged as a crucial protective measure for LLMs.
To facilitate the development of structural unlearning methods, we propose PISTOL, a pipeline for compiling multi-scenario datasets.
We conduct benchmarks with four distinct unlearning methods on both Llama2-7B and Mistral-7B models.
arXiv Detail & Related papers (2024-06-24T17:22:36Z) - Elephants Never Forget: Testing Language Models for Memorization of
Tabular Data [21.912611415307644]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over.
We introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization.
arXiv Detail & Related papers (2024-03-11T12:07:13Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Enhancing Subtask Performance of Multi-modal Large Language Model [12.033301861738952]
Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data.
This study selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches.
The results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask.
arXiv Detail & Related papers (2023-08-31T05:37:21Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.