LastingBench: Defend Benchmarks Against Knowledge Leakage
- URL: http://arxiv.org/abs/2506.21614v1
- Date: Sat, 21 Jun 2025 13:01:04 GMT
- Title: LastingBench: Defend Benchmarks Against Knowledge Leakage
- Authors: Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, Xiaodong Gu,
- Abstract summary: complexity of large language models (LLMs) raises concerns about their ability to "cheat" on standard Question Answering (QA) benchmarks by memorizing task-specific data.<n>This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage.<n>LastingBench is a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage.
- Score: 5.476393238638673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing complexity of large language models (LLMs) raises concerns about their ability to "cheat" on standard Question Answering (QA) benchmarks by memorizing task-specific data. This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage. While prior work has focused on detecting such leakage, little attention has been given to mitigating its impact and preserving the long-term utility of benchmarks. In this paper, we introduce LastingBench, a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage. LastingBench identifies leakage points in the context through perturbation, then rewrites the leakage points to counterfactual ones-disrupting memorization while preserving the benchmark's original evaluative intent. Evaluations of state-of-the-art QA benchmarks show significant performance gaps, highlighting the efficacy of LastingBench in reducing memorization effects. LastingBench offers a practical and scalable solution to ensure benchmark robustness over time, promoting fairer and more interpretable evaluations of LLMs.
Related papers
- How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework [8.76693832650115]
Overestimation in evaluating large language models (LLMs) has become an increasing concern.<n>We propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography.
arXiv Detail & Related papers (2025-07-25T12:39:03Z) - The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [1.6249398255272318]
We introduce two diagnostic tasks: file path identification from issue descriptions alone, and ground truth function reproduction with only the current file context and issue description.<n>We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure.<n>This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization.
arXiv Detail & Related papers (2025-06-14T00:25:26Z) - Jailbreak Distillation: Renewable Safety Benchmarking [42.07193013496905]
Large language models (LLMs) are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking.<n>We propose Jailbreak Distillation (JBDistill), a novel benchmark construction framework that "distills" jailbreak attacks into high-quality and easily-updatable safety benchmarks.
arXiv Detail & Related papers (2025-05-28T06:59:46Z) - Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation [80.69067017594709]
Large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks.<n>We propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time.<n>Our method significantly outperforms standard agentic systems that do not utilize logs.
arXiv Detail & Related papers (2025-05-20T14:14:38Z) - PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.<n>Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.<n>We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z) - AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge [68.39683427262335]
Existing studies fail to guarantee contamination-free evaluation as newly collected data may contain pre-existing knowledge.<n>We propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.
arXiv Detail & Related papers (2024-12-18T09:53:12Z) - Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench [15.565644819269803]
We show how some overlooked methodological choices can significantly influence Benchmark Agreement Testing (BAT) results.
We introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers.
arXiv Detail & Related papers (2024-07-18T17:00:23Z) - Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.<n>We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.<n>We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z) - Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability.
Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off.
We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.