LatestEval: Addressing Data Contamination in Language Model Evaluation
through Dynamic and Time-Sensitive Test Construction
- URL: http://arxiv.org/abs/2312.12343v3
- Date: Fri, 1 Mar 2024 15:17:21 GMT
- Title: LatestEval: Addressing Data Contamination in Language Model Evaluation
through Dynamic and Time-Sensitive Test Construction
- Authors: Yucheng Li, Frank Guerin, Chenghua Lin
- Abstract summary: LatestEval is an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations.
It avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models.
Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks.
- Score: 21.553915781660905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data contamination in evaluation is getting increasingly prevalent with the
emergence of language models pre-trained on super large, automatically crawled
corpora. This problem leads to significant challenges in the accurate
assessment of model capabilities and generalisations. In this paper, we propose
LatestEval, an automatic method that leverages the most recent texts to create
uncontaminated reading comprehension evaluations. LatestEval avoids data
contamination by only using texts published within a recent time window,
ensuring no overlap with the training corpora of pre-trained language models.
We develop the LatestEval automated pipeline to 1) gather the latest texts; 2)
identify key information, and 3) construct questions targeting the information
while removing the existing answers from the context. This encourages models to
infer the answers themselves based on the remaining context, rather than just
copy-paste. Our experiments demonstrate that language models exhibit negligible
memorisation behaviours on LatestEval as opposed to previous benchmarks,
suggesting a significantly reduced risk of data contamination and leading to a
more robust evaluation. Data and code are publicly available at:
https://github.com/liyucheng09/LatestEval.
Related papers
- VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation [16.889939234103153]
We propose to variabilize benchmarks and evaluate language models dynamically.
Specifically, we extract variables from each test case and define a value range for each variable.
For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time.
arXiv Detail & Related papers (2024-06-25T16:13:53Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models [39.37532848489779]
We propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data.
We show that ENT improves generation quality over standard training and previous soft and hard truncation methods.
arXiv Detail & Related papers (2023-10-02T01:30:27Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Mitigating harm in language models with conditional-likelihood
filtration [4.002298833349518]
We present a methodology for identifying harmful views from webscale unfiltered datasets.
We demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text.
We also discuss how trigger phrases which specific values can be used by researchers to build language models which are more closely aligned with their values.
arXiv Detail & Related papers (2021-08-04T22:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.