NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination
for each Benchmark
- URL: http://arxiv.org/abs/2310.18018v1
- Date: Fri, 27 Oct 2023 09:48:29 GMT
- Title: NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination
for each Benchmark
- Authors: Oscar Sainz, Jon Ander Campos, Iker Garc\'ia-Ferrero, Julen Etxaniz,
Oier Lopez de Lacalle, Eneko Agirre
- Abstract summary: We argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble.
The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark.
This position paper defines different levels of data contamination and argues for a community effort.
- Score: 19.875954121100005
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this position paper, we argue that the classical evaluation on Natural
Language Processing (NLP) tasks using annotated benchmarks is in trouble. The
worst kind of data contamination happens when a Large Language Model (LLM) is
trained on the test split of a benchmark, and then evaluated in the same
benchmark. The extent of the problem is unknown, as it is not straightforward
to measure. Contamination causes an overestimation of the performance of a
contaminated model in a target benchmark and associated task with respect to
their non-contaminated counterparts. The consequences can be very harmful, with
wrong scientific conclusions being published while other correct ones are
discarded. This position paper defines different levels of data contamination
and argues for a community effort, including the development of automatic and
semi-automatic measures to detect when data from a benchmark was exposed to a
model, and suggestions for flagging papers with conclusions that are
compromised by data contamination.
Related papers
- Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? [10.691754344782387]
It is difficult to define precisely which samples should be considered contaminated, and how it impacts benchmark scores.
We propose a novel analysis method called ConTAM, and show with a large scale survey of evaluation data contamination metrics.
We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales.
arXiv Detail & Related papers (2024-11-06T13:54:08Z) - PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models [41.772263447213234]
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks.
This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications.
We introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs.
arXiv Detail & Related papers (2024-06-26T13:12:40Z) - ConStat: Performance-Based Contamination Detection in Large Language Models [7.305342793164905]
ConStat is a statistical method that reliably detects and quantifies contamination by comparing performance between a primary and reference benchmark relative to a set of reference models.
We demonstrate the effectiveness of ConStat in an extensive evaluation of diverse model architectures, benchmarks, and contamination scenarios.
arXiv Detail & Related papers (2024-05-25T15:36:37Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - Time Travel in LLMs: Tracing Data Contamination in Large Language Models [29.56037518816495]
We propose a straightforward yet effective method for identifying data contamination within large language models (LLMs)
At its core, our approach starts by identifying potential contamination at the instance level.
To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance.
arXiv Detail & Related papers (2023-08-16T16:48:57Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.