Related papers: Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

URL: http://arxiv.org/abs/2402.11894v3
Date: Thu, 6 Jun 2024 12:49:44 GMT
Title: Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models
Authors: Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, Shuicheng Yan,
Abstract summary: Large language models (LLMs) have achieved impressive performance across various natural language benchmarks. We propose to automate dataset updating and provide systematic analysis regarding its effectiveness. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
Score: 81.27391252152199
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved impressive performance across various natural language benchmarks, prompting a continual need to curate more difficult datasets for larger LLMs, which is costly and time-consuming. In this paper, we propose to automate dataset updating and provide systematic analysis regarding its effectiveness in dealing with benchmark leakage issue, difficulty control, and stability. Thus, once the current benchmark has been mastered or leaked, we can update it for timely and reliable evaluation. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom's taxonomy of educational objectives. Extensive experiments on updated MMLU and BIG-Bench demonstrate the stability of the proposed strategies and find that the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategy still shows promising results. Additionally, by controlling the difficulty, we can better discern the models' performance and enable fine-grained analysis neither too difficult nor too easy an exam can fairly judge students' learning status. To the best of our knowledge, we are the first to automate updating benchmarks for reliable and timely evaluation. Our demo leaderboard can be found at https://yingjiahao14.github.io/Automating-DatasetUpdates/.

Related papers

Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework [97.11629413081651]
We propose Sticker-TTS, a novel test-time scaling framework for large reasoning models.<n>At the core of our framework are distilled key conditions-termed stickers that drive the extraction, refinement, and reuse of critical information.<n>We show that Sticker-TTS consistently surpasses strong baselines, including self-consistency and advanced reinforcement learning approaches.
arXiv Detail & Related papers (2025-09-05T11:14:11Z)
RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection [54.194199777900934]
Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability.<n>We introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step.<n>Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods.
arXiv Detail & Related papers (2025-08-19T14:42:19Z)
TAPS : Frustratingly Simple Test Time Active Learning for VLMs [0.0]
Test-Time Optimization enables models to adapt to new data during inference by updating parameters on-the-fly.<n>We propose a novel Test-Time Active Learning framework that adaptively queries uncertain samples and updates prompts dynamically.<n>Our framework provides a practical and effective solution for real-world deployment in safety-critical applications such as autonomous systems and medical diagnostics.
arXiv Detail & Related papers (2025-07-26T18:04:49Z)
Memorization vs. Reasoning: Updating LLMs with New Knowledge [12.214561228023511]
We introduce Knowledge Update Playground (KUP), an automatic pipeline for simulating realistic knowledge updates. We present a lightweight method called memory conditioned training (MCT), which conditions tokens in the update corpus on self-generated "memory" tokens during training. Our results show that (1) KUP benchmark is highly challenging, with the best CPT models achieving $2%$ in indirect probing setting (reasoning) and (2) MCT training significantly outperforms prior continued pre-training (CPT) baselines.
arXiv Detail & Related papers (2025-04-16T23:03:40Z)
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations. They generate only a limited range of perturbations for a single Information Extraction (IE) task. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench. We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications. Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Position: LLM Unlearning Benchmarks are Weak Measures of Progress [31.957968729934745]
We find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information.
arXiv Detail & Related papers (2024-10-03T18:07:25Z)
Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection [1.4530711901349282]
Test-Time Adaptation (TTA) has emerged as a promising strategy for tackling the problem of machine learning model robustness under distribution shifts. We evaluate existing TTA methods using surrogate-based hp-selection strategies to obtain a more realistic evaluation of their performance.
arXiv Detail & Related papers (2024-07-19T11:58:30Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Re-Benchmarking Pool-Based Active Learning for Binary Classification [27.034593234956713]
Active learning is a paradigm that significantly enhances the performance of machine learning models when acquiring labeled data. While several benchmarks exist for evaluating active learning strategies, their findings exhibit some misalignment. This discrepancy motivates us to develop a transparent and reproducible benchmark for the community.
arXiv Detail & Related papers (2023-06-15T08:47:50Z)
Information Association for Language Model Updating by Mitigating LM-Logical Discrepancy [68.31760483418901]
Large Language Models(LLMs) struggle with providing current information due to the outdated pre-training data. Existing methods for updating LLMs, such as knowledge editing and continual fine-tuning, have significant drawbacks in generalizability of new information. We identify the core challenge behind these drawbacks: the LM-logical discrepancy featuring the difference between language modeling probabilities and logical probabilities.
arXiv Detail & Related papers (2023-05-29T19:48:37Z)
Text Generation by Learning from Demonstrations [17.549815256968877]
Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. We propose GOLD: an easy-to-optimize algorithm that learns from expert demonstrations by importance weighting. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient.
arXiv Detail & Related papers (2020-09-16T17:58:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.