Related papers: When Benchmarks Leak: Inference-Time Decontamination for LLMs

When Benchmarks Leak: Inference-Time Decontamination for LLMs

URL: http://arxiv.org/abs/2601.19334v1
Date: Tue, 27 Jan 2026 08:19:40 GMT
Title: When Benchmarks Leak: Inference-Time Decontamination for LLMs
Authors: Jianzhe Chai, Yu Zhe, Jun Sakuma,
Abstract summary: DeconIEP operates entirely during evaluation by applying small, bounded perturbations in the input embedding space.<n>We propose DeconIEP, a decontamination framework that operates entirely during evaluation by applying small, bounded perturbations in the input embedding space.
Score: 4.071875179293035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Benchmark-based evaluation is the de facto standard for comparing large language models (LLMs). However, its reliability is increasingly threatened by test set contamination, where test samples or their close variants leak into training data and artificially inflate reported performance. To address this issue, prior work has explored two main lines of mitigation. One line attempts to identify and remove contaminated benchmark items before evaluation, but this inevitably alters the evaluation set itself and becomes unreliable when contamination is moderate or severe. The other line preserves the benchmark and instead suppresses contaminated behavior at evaluation time; however, such interventions often interfere with normal inference and lead to noticeable performance degradation on clean inputs. We propose DeconIEP, a decontamination framework that operates entirely during evaluation by applying small, bounded perturbations in the input embedding space. Guided by a relatively less-contaminated reference model, DeconIEP learns an instance-adaptive perturbation generator that steers the evaluated model away from memorization-driven shortcut pathways. Across multiple open-weight LLMs and benchmarks, extensive empirical results show that DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility.

Related papers

Contamination Detection for VLMs using Multi-Modal Semantic Perturbation [73.76465227729818]
Open-source Vision-Language Models (VLMs) have achieved state-of-the-art performance on benchmark tasks.<n>Pretraining corpora raise a critical concern for both practitioners and users: inflated performance due to test-set leakage.<n>We show that existing detection approaches either fail outright or exhibit inconsistent behavior.<n>We propose a novel simple yet effective detection method based on multi-modal semantic perturbation.
arXiv Detail & Related papers (2025-11-05T18:59:52Z)
How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework [8.76693832650115]
Overestimation in evaluating large language models (LLMs) has become an increasing concern.<n>We propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography.
arXiv Detail & Related papers (2025-07-25T12:39:03Z)
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation [6.4212082894269535]
We compare existing leakage detection techniques, namely permutation and n-gram-based methods.<n>Our analysis shows that the n-gram method consistently achieves the highest F1-score.<n>We create cleaned versions of MMLU and HellaSwag, and re-evaluate several LLMs.
arXiv Detail & Related papers (2025-05-30T06:37:39Z)
PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models [41.772263447213234]
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks.<n>This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications.<n>We introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs.
arXiv Detail & Related papers (2024-06-26T13:12:40Z)
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation [61.350306618479365]
Leakage of benchmarks can prevent the accurate assessment of large language models' true performance. We propose Inference-Time Decontamination (ITD) to address this issue. ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU.
arXiv Detail & Related papers (2024-06-20T04:35:59Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs) Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z)
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z)
NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark [19.875954121100005]
We argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. This position paper defines different levels of data contamination and argues for a community effort.
arXiv Detail & Related papers (2023-10-27T09:48:29Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.