AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models
- URL: http://arxiv.org/abs/2501.13983v1
- Date: Thu, 23 Jan 2025 06:57:24 GMT
- Title: AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models
- Authors: Yang Fan,
- Abstract summary: AdEval is a dynamic data evaluation method aimed at mitigating the impact of data contamination on evaluation reliability.<n>It extracts key knowledge points and main ideas to align dynamically generated questions with static data's core concepts.<n>It also leverages online search to provide detailed explanations of related knowledge points, thereby creating high-quality evaluation samples.
- Score: 2.463617251923349
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) are pretrained on massive-scale corpora, the issue of data contamination has become increasingly severe, leading to potential overestimation of model performance during evaluation. To address this, we propose AdEval (Alignment-based Dynamic Evaluation), a dynamic data evaluation method aimed at mitigating the impact of data contamination on evaluation reliability. AdEval extracts key knowledge points and main ideas to align dynamically generated questions with static data's core concepts. It also leverages online search to provide detailed explanations of related knowledge points, thereby creating high-quality evaluation samples with robust knowledge support. Furthermore, AdEval incorporates mechanisms to control the number and complexity of questions, enabling dynamic alignment and flexible adjustment. This ensures that the generated questions align with the complexity of static data while supporting varied complexity levels. Based on Bloom's taxonomy, AdEval conducts a multi-dimensional evaluation of LLMs across six cognitive levels: remembering, understanding, applying, analyzing, evaluating, and creating. Experimental results on multiple datasets demonstrate that AdEval effectively reduces the impact of data contamination on evaluation outcomes, enhancing both the fairness and reliability of the evaluation process.
Related papers
- Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection.
By leveraging influence scores, we effectively identify the most impactful data for system improvement.
We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z) - How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence [23.019102917957152]
datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations.
We propose Kernel Divergence Score (KDS), a novel method that quantifies dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings.
KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines.
arXiv Detail & Related papers (2025-02-02T05:50:39Z) - Testing and Improving the Robustness of Amortized Bayesian Inference for Cognitive Models [0.5223954072121659]
Contaminant observations and outliers often cause problems when estimating the parameters of cognitive models.
In this study, we test and improve the robustness of parameter estimation using amortized Bayesian inference.
The proposed method is straightforward and practical to implement and has a broad applicability in fields where outlier detection or removal is challenging.
arXiv Detail & Related papers (2024-12-29T21:22:24Z) - Integrating Dynamic Correlation Shifts and Weighted Benchmarking in Extreme Value Analysis [1.8641315013048299]
This paper presents an innovative approach to Extreme Value Analysis (EVA) by introducing the Extreme Value Dynamic Benchmarking Method (EVDBM)
EVDBM integrates extreme value theory to detect extreme events and is coupled with the novel Dynamic Identification of Significant Correlation (DISC)-Thresholding algorithm.
The flexibility of EVDBM suggests its potential for broader applications in other sectors where decision-making sensitivity is crucial.
arXiv Detail & Related papers (2024-11-19T21:00:39Z) - Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models [52.368110271614285]
We introduce AdvEval, a novel black-box adversarial framework against NLG evaluators.
AdvEval is specially tailored to generate data that yield strong disagreements between human and victim evaluators.
We conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks including dialogue, summarization, and question evaluation.
arXiv Detail & Related papers (2024-05-23T14:48:15Z) - FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models [36.273451767886726]
FreeEval is a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of large language models.
FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies.
The framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules, enhance the fairness of the evaluation outcomes.
arXiv Detail & Related papers (2024-04-09T04:17:51Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks.
We propose to automate dataset updating and provide systematic analysis regarding its effectiveness.
There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - GMValuator: Similarity-based Data Valuation for Generative Models [41.76259565672285]
We introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to providing data valuation for image generation tasks.
GMValuator is extensively evaluated on benchmark and high-resolution datasets to demonstrate its effectiveness.
arXiv Detail & Related papers (2023-04-21T02:02:02Z) - Statistical Dataset Evaluation: Reliability, Difficulty, and Validity [18.36931975072938]
We propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation.
We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity.
arXiv Detail & Related papers (2022-12-19T06:55:42Z) - Evaluation of Synthetic Datasets for Conversational Recommender Systems [0.0]
The absence of robust evaluation frameworks has been a long-standing problem.
Since the quality of training data is critical for downstream applications, it is important to develop metrics that evaluate the quality holistically.
In this paper, we present a framework that takes a multi-faceted approach towards evaluating datasets produced by generative models.
arXiv Detail & Related papers (2022-12-12T18:53:10Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Robust Trajectory Prediction against Adversarial Attacks [84.10405251683713]
Trajectory prediction using deep neural networks (DNNs) is an essential component of autonomous driving systems.
These methods are vulnerable to adversarial attacks, leading to serious consequences such as collisions.
In this work, we identify two key ingredients to defend trajectory prediction models against adversarial attacks.
arXiv Detail & Related papers (2022-07-29T22:35:05Z) - Stateful Offline Contextual Policy Evaluation and Learning [88.9134799076718]
We study off-policy evaluation and learning from sequential data.
We formalize the relevant causal structure of problems such as dynamic personalized pricing.
We show improved out-of-sample policy performance in this class of relevant problems.
arXiv Detail & Related papers (2021-10-19T16:15:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.