LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
- URL: http://arxiv.org/abs/2509.15218v1
- Date: Thu, 18 Sep 2025 17:59:16 GMT
- Title: LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
- Authors: Ruijie Hou, Yueyang Jiao, Hanxu Hu, Yingming Li, Wai Lam, Huajian Zhang, Hongyuan Lu,
- Abstract summary: We propose a novel framework, textbfLNE-Blocking, to restore model performance prior to contamination on potentially leaked datasets.<n>Our framework is the first to efficiently restore the model's greedy decoding performance.
- Score: 42.94267844722955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The problem of data contamination is now almost inevitable during the development of large language models (LLMs), with the training data commonly integrating those evaluation benchmarks even unintentionally. This problem subsequently makes it hard to benchmark LLMs fairly. Instead of constructing contamination-free datasets (quite hard), we propose a novel framework, \textbf{LNE-Blocking}, to restore model performance prior to contamination on potentially leaked datasets. Our framework consists of two components: contamination detection and disruption operation. For the prompt, the framework first uses the contamination detection method, \textbf{LNE}, to assess the extent of contamination in the model. Based on this, it adjusts the intensity of the disruption operation, \textbf{Blocking}, to elicit non-memorized responses from the model. Our framework is the first to efficiently restore the model's greedy decoding performance. This comes with a strong performance on multiple datasets with potential leakage risks, and it consistently achieves stable recovery results across different models and varying levels of data contamination. We release the code at https://github.com/RuijieH/LNE-Blocking to facilitate research.
Related papers
- Benchmark Leakage Trap: Can We Trust LLM-based Recommendation? [9.574427977779235]
This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation.<n>Data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance.
arXiv Detail & Related papers (2026-02-14T06:34:19Z) - VeriContaminated: Assessing LLM-Driven Verilog Coding for Data Contamination [15.52442661491358]
Large Language Models (LLMs) have revolutionized code generation, achieving exceptional results on various established benchmarking frameworks.<n>However, concerns about data contamination raise questions about the validity of these evaluations.<n>We analyze state-of-the-art (SOTA) evaluation frameworks for Verilog code generation.
arXiv Detail & Related papers (2025-03-17T12:26:49Z) - Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination [18.006532081289627]
We propose tool, a novel benchmarking suite for evaluating Code LLMs under potential data contamination.<n>tool employs multiple agents to extract and modify the context without altering the core logic, generating semantically equivalent variations.<n>Results show that tool effectively benchmarks reasoning capabilities under contamination risks while generating diverse problem sets to ensure consistent and reliable evaluations.
arXiv Detail & Related papers (2025-03-06T06:56:59Z) - A Survey on Data Contamination for Large Language Models [12.431575579432458]
Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis.<n>The reliability of performance evaluation has come under scrutiny due to data contamination.
arXiv Detail & Related papers (2025-02-20T10:23:27Z) - Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z) - AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge [68.39683427262335]
Existing studies fail to guarantee contamination-free evaluation as newly collected data may contain pre-existing knowledge.<n>We propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.
arXiv Detail & Related papers (2024-12-18T09:53:12Z) - Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM [53.05486269607166]
multimodal large language models (MLLMs) have significantly enhanced performance across benchmarks.<n>Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training.<n>We analyze multimodal data contamination using our analytical framework, MM-Detect, which defines two contamination categories-unimodal and cross-modal.
arXiv Detail & Related papers (2024-11-06T10:44:15Z) - Quantifying Contamination in Evaluating Code Generation Capabilities of
Language Models [27.24738197172374]
Large language models have achieved remarkable performance on various code generation benchmarks.
There have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data.
We show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training.
arXiv Detail & Related papers (2024-03-06T21:45:35Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.