Related papers: Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation

Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation

URL: http://arxiv.org/abs/2501.18771v1
Date: Thu, 30 Jan 2025 21:51:18 GMT
Title: Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation
Authors: Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch, Jiaming Luo, Colin Cherry, Markus Freitag,
Abstract summary: We study the effects of contamination on language models at 1B and 8B scales on the machine translation task.<n>Our experiments reveal that contamination with both source and target substantially inflates BLEU scores.<n>In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations.
Score: 46.148465860465095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data contamination -- the accidental consumption of evaluation examples within the pre-training data -- can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.

Related papers

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? [10.691754344782387]
It is difficult to define precisely which samples should be considered contaminated, and how it impacts benchmark scores. We propose a novel analysis method called ConTAM, and show with a large scale survey of evaluation data contamination metrics. We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales.
arXiv Detail & Related papers (2024-11-06T13:54:08Z)
A Taxonomy for Data Contamination in Large Language Models [12.643103231497813]
A growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus. Decontamination, the process of detecting and removing such data, is a potential solution. How different types of contamination impact the performance of language models on downstream tasks is not fully understood.
arXiv Detail & Related papers (2024-07-11T17:50:34Z)
ConStat: Performance-Based Contamination Detection in Large Language Models [7.305342793164905]
ConStat is a statistical method that reliably detects and quantifies contamination by comparing performance between a primary and reference benchmark relative to a set of reference models. We demonstrate the effectiveness of ConStat in an extensive evaluation of diverse model architectures, benchmarks, and contamination scenarios.
arXiv Detail & Related papers (2024-05-25T15:36:37Z)
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation [53.27596811146316]
Diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts. We present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep. We introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest.
arXiv Detail & Related papers (2024-01-17T07:58:18Z)
Investigating Data Contamination for Pre-training Language Models [46.335755305642564]
We explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models. We highlight the effect of both text contamination (textiti.e. input text of the evaluation samples) and ground-truth contamination (textiti.e. the prompts asked on the input and the desired outputs) from evaluation data.
arXiv Detail & Related papers (2024-01-11T17:24:49Z)
Rethinking InfoNCE: How Many Negative Samples Do You Need? [54.146208195806636]
We study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework. We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function.
arXiv Detail & Related papers (2021-05-27T08:38:29Z)
Efficient Causal Inference from Combined Observational and Interventional Data through Causal Reductions [68.6505592770171]
Unobserved confounding is one of the main challenges when estimating causal effects. We propose a novel causal reduction method that replaces an arbitrary number of possibly high-dimensional latent confounders. We propose a learning algorithm to estimate the parameterized reduced model jointly from observational and interventional data.
arXiv Detail & Related papers (2021-03-08T14:29:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.