Benchmarking Benchmark Leakage in Large Language Models
- URL: http://arxiv.org/abs/2404.18824v1
- Date: Mon, 29 Apr 2024 16:05:36 GMT
- Title: Benchmarking Benchmark Leakage in Large Language Models
- Authors: Ruijie Xu, Zengzhi Wang, Run-Ze Fan, Pengfei Liu,
- Abstract summary: We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark.
We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons.
We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
- Score: 24.015208839742343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary Large Language Models (LLMs). This issue skews benchmark effectiveness and fosters potentially unfair comparisons, impeding the field's healthy development. To address this, we introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark, to identify potential data leakages. By analyzing 31 LLMs under the context of mathematical reasoning, we reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. These findings prompt us to offer several recommendations regarding model documentation, benchmark setup, and future evaluations. Notably, we propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization, promoting transparency and healthy developments of LLMs. we have made our leaderboard, pipeline implementation, and model predictions publicly available, fostering future research.
Related papers
- Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts [0.6282171844772422]
Training data for many Large Language Models (LLMs) is contaminated with test data.
Public benchmark scores do not always accurately assess model properties.
arXiv Detail & Related papers (2024-10-11T20:46:56Z) - PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models [41.772263447213234]
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks.
This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications.
We introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs.
arXiv Detail & Related papers (2024-06-26T13:12:40Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks.
This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data.
We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z) - Evaluating and Explaining Large Language Models for Code Using Syntactic
Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code.
At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes.
We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z) - Temporal Knowledge Graph Forecasting Without Knowledge Using In-Context
Learning [23.971206470486468]
We present a framework that converts relevant historical facts into prompts and generates ranked predictions using token probabilities.
Surprisingly, we observe that LLMs, out-of-the-box, perform on par with state-of-the-art TKG models.
We also discover that using numerical indices instead of entity/relation names, does not significantly affect the performance.
arXiv Detail & Related papers (2023-05-17T23:50:28Z) - Can LMs Generalize to Future Data? An Empirical Analysis on Text
Summarization [50.20034493626049]
Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets.
Existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets.
We show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data.
arXiv Detail & Related papers (2023-05-03T08:08:07Z) - Exploring validation metrics for offline model-based optimisation with
diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle.
While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples.
This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z) - SynBench: Task-Agnostic Benchmarking of Pretrained Representations using
Synthetic Data [78.21197488065177]
Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning.
This paper proposes a new task-agnostic framework, textitSynBench, to measure the quality of pretrained representations using synthetic data.
arXiv Detail & Related papers (2022-10-06T15:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.