AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
- URL: http://arxiv.org/abs/2412.13670v1
- Date: Wed, 18 Dec 2024 09:53:12 GMT
- Title: AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
- Authors: Xiaobao Wu, Liangming Pan, Yuxi Xie, Ruiwen Zhou, Shuai Zhao, Yubo Ma, Mingzhe Du, Rui Mao, Anh Tuan Luu, William Yang Wang,
- Abstract summary: Existing studies fail to guarantee contamination-free evaluation as newly collected data may contain pre-existing knowledge.
We propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.
- Score: 68.39683427262335
- License:
- Abstract: Data contamination hinders fair LLM evaluation by introducing test data into newer models' training sets. Existing studies solve this challenge by updating benchmarks with newly collected data. However, they fail to guarantee contamination-free evaluation as the newly collected data may contain pre-existing knowledge, and their benchmark updates rely on intensive human labor. To address these issues, we in this paper propose AntiLeak-Bench, an automated anti-leakage benchmarking framework. Instead of simply using newly collected data, we construct samples with explicitly new knowledge absent from LLMs' training sets, which thus ensures strictly contamination-free evaluation. We further design a fully automated workflow to build and update our benchmark without human labor. This significantly reduces the cost of benchmark maintenance to accommodate emerging LLMs. Through extensive experiments, we highlight that data contamination likely exists before LLMs' cutoff time and demonstrate AntiLeak-Bench effectively overcomes this challenge.
Related papers
- Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - Addressing Data Leakage in HumanEval Using Combinatorial Test Design [1.9336815376402723]
The HumanEval benchmark is composed of 164 hand-crafted tasks.
A major barrier to a fair evaluation of LLMs using benchmarks like HumanEval is data contamination.
We propose a new benchmark construction method where a benchmark is composed of template tasks that can be instantiated into new concrete tasks.
arXiv Detail & Related papers (2024-12-02T14:18:32Z) - LiveBench: A Challenging, Contamination-Free LLM Benchmark [101.21578097087699]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources.
We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size.
Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z) - CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models [12.367149496971408]
Clean-Eval mitigates the issue of data contamination and evaluates the models in a cleaner manner.
Clean-Eval employs an LLM to paraphrase and back-translate the contaminated data into a candidate set.
A semantic detector is then used to filter the generated low-quality samples.
The best candidate is finally selected from this set based on the BLEURT score.
arXiv Detail & Related papers (2023-11-15T17:50:30Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination
for each Benchmark [19.875954121100005]
We argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble.
The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark.
This position paper defines different levels of data contamination and argues for a community effort.
arXiv Detail & Related papers (2023-10-27T09:48:29Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks.
This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data.
We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.