Benchmark Data Contamination of Large Language Models: A Survey
- URL: http://arxiv.org/abs/2406.04244v1
- Date: Thu, 6 Jun 2024 16:41:39 GMT
- Title: Benchmark Data Contamination of Large Language Models: A Survey
- Authors: Cheng Xu, Shuhao Guan, Derek Greene, M-Tahar Kechadi,
- Abstract summary: This paper reviews the complex challenge of Benchmark Data Contamination (BDC) in Large Language Models (LLMs) evaluation.
It explores alternative assessment methods to mitigate the risks associated with traditional benchmarks.
- Score: 5.806534973464769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.
Related papers
- Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model [6.106667677504318]
Retrieval-augmented Generation (RAG) systems have been actively studied and deployed across various industries to query on domain-specific knowledge base.
evaluating these systems presents unique challenges due to the scarcity of domain-specific queries and corresponding ground truths.
We introduce GRAMMAR, an evaluation framework comprising two key elements: 1) a data generation process that leverages relational databases and LLMs to efficiently produce scalable query-answer pairs for evaluation; and 2) an evaluation framework that differentiates knowledge gaps from robustness and enables the identification of defective modules.
arXiv Detail & Related papers (2024-04-30T03:29:30Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Revisit Input Perturbation Problems for LLMs: A Unified Robustness
Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models.
Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data.
Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z) - Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models.
We analyze the performance of different large language models in 4 fundamental abilities required for RAG.
We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z) - Information Association for Language Model Updating by Mitigating
LM-Logical Discrepancy [68.31760483418901]
Large Language Models(LLMs) struggle with providing current information due to the outdated pre-training data.
Existing methods for updating LLMs, such as knowledge editing and continual fine-tuning, have significant drawbacks in generalizability of new information.
We identify the core challenge behind these drawbacks: the LM-logical discrepancy featuring the difference between language modeling probabilities and logical probabilities.
arXiv Detail & Related papers (2023-05-29T19:48:37Z) - Evaluation of Synthetic Datasets for Conversational Recommender Systems [0.0]
The absence of robust evaluation frameworks has been a long-standing problem.
Since the quality of training data is critical for downstream applications, it is important to develop metrics that evaluate the quality holistically.
In this paper, we present a framework that takes a multi-faceted approach towards evaluating datasets produced by generative models.
arXiv Detail & Related papers (2022-12-12T18:53:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.