Related papers: On Inter-dataset Code Duplication and Data Leakage in Large Language Models

On Inter-dataset Code Duplication and Data Leakage in Large Language Models

URL: http://arxiv.org/abs/2401.07930v1
Date: Mon, 15 Jan 2024 19:46:40 GMT
Title: On Inter-dataset Code Duplication and Data Leakage in Large Language Models
Authors: Jos\'e Antonio Hern\'andez L\'opez, Boqi Chen, Tushar Sharma, D\'aniel Varr\'o
Abstract summary: This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating large language models (LLMs) We identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. We fine-tune four models pre-trained on CSN to evaluate their performance on samples encountered during pre-training and those unseen during that phase.
Score: 5.704848262917858
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Motivation. Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. Problem statement. Data leakage is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While intra-dataset code duplication examines this intersection within a given dataset and has been addressed in prior research, inter-dataset code duplication, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of LLM evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. Contribution. This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating LLMs across diverse SE tasks. Study design. We conduct an empirical study using the CSN dataset, a widely adopted pre-training dataset, and five fine-tuning datasets used for various SE tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Then, we fine-tune four models pre-trained on CSN to evaluate their performance on samples encountered during pre-training and those unseen during that phase. Results. Our findings reveal a potential threat to the evaluation of various LLMs across multiple SE tasks, stemming from the inter-dataset code duplication phenomenon. Moreover, we demonstrate that this threat is accentuated by factors like the LLM's size and the chosen fine-tuning technique.

Related papers

Unsupervised Topic Models are Data Mixers for Pre-training Language Models [6.77198566340415]
We propose a topic-based data mixing strategy for large language models (LLMs) DataWeave employs a multi-stage clustering process to group semantically similar documents. We confirm that the topics Science and Relationships are particularly effective, yielding the most substantial performance improvements.
arXiv Detail & Related papers (2025-02-24T03:25:56Z)
Are Large Language Models Good Data Preprocessors? [5.954202581988127]
High-quality textual training data is essential for the success of multimodal data processing tasks. outputs from image captioning models like BLIP and GIT often contain errors and anomalies that are difficult to rectify using rule-based methods.
arXiv Detail & Related papers (2025-02-24T02:57:21Z)
Understanding and Mitigating the Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks [24.706895491806794]
This work presents the first systematic investigation in understanding, analyzing, and mitigating bias inheritance. We analyze how 6 different types of biases manifest at varying bias ratios. We propose three mitigation strategies: token-based, mask-based, and loss-based approaches.
arXiv Detail & Related papers (2025-02-06T15:20:58Z)
Large Language Models are Few-shot Multivariate Time Series Classifiers [23.045734479292356]
Large Language Models (LLMs) have been extensively applied in time series analysis. Yet, their utility in the few-shot classification (i.e., a crucial training scenario) is underexplored. We aim to leverage the extensive pre-trained knowledge in LLMs to overcome the data scarcity problem.
arXiv Detail & Related papers (2025-01-30T03:59:59Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Empirical Insights on Fine-Tuning Large Language Models for Question-Answering [50.12622877002846]
Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can be fine-tuned for the question-answering (QA) task. We categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs. Our experiments show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task.
arXiv Detail & Related papers (2024-09-24T07:38:38Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting [45.0261082985087]
We conduct a comprehensive evaluation of Large Language Models (LLMs) for temporal event forecasting. We find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance. In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance.
arXiv Detail & Related papers (2024-07-16T11:58:54Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs [31.16117964915814]
Machine unlearning, which seeks to erase specific data stored in the pre-trained or fine-tuned models, has emerged as a crucial protective measure for LLMs. To facilitate the development of structural unlearning methods, we propose PISTOL, a pipeline for compiling multi-scenario datasets. We conduct benchmarks with four distinct unlearning methods on both Llama2-7B and Mistral-7B models.
arXiv Detail & Related papers (2024-06-24T17:22:36Z)
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data [21.912611415307644]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over. We introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization.
arXiv Detail & Related papers (2024-03-11T12:07:13Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Enhancing Subtask Performance of Multi-modal Large Language Model [12.033301861738952]
Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. This study selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches. The results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask.
arXiv Detail & Related papers (2023-08-31T05:37:21Z)
Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis. For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.