ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
- URL: http://arxiv.org/abs/2502.16268v1
- Date: Sat, 22 Feb 2025 15:41:51 GMT
- Title: ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
- Authors: Shulin Huang, Linyi Yang, Yan Song, Shuang Chen, Leyang Cui, Ziyu Wan, Qingcheng Zeng, Ying Wen, Kun Shao, Weinan Zhang, Jun Wang, Yue Zhang,
- Abstract summary: ThinkBench is an evaluation framework for large language models (LLMs)<n>It unifies the evaluation of reasoning models and non-reasoning models.<n>ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.
- Score: 61.750373974799366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.
Related papers
- Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.
They generate only a limited range of perturbations for a single Information Extraction (IE) task.
Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.
We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z) - Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity [61.48338027901318]
We show that fine-tuning with LLM-generated data improves target task performance and reduces out-of-domain degradation.<n>This is the first mechanistic explanation for the superior OOD robustness conferred by LLM-generated training data.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models.
It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities.
We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM [7.702325506088706]
We propose an approach leveraging Fine-grained Feedback with Reinforcement Retrieval (FFRR) to enhance fact-checking on news claims.
We evaluate our model on two public datasets for real-world news claim verification.
arXiv Detail & Related papers (2024-04-26T09:38:27Z) - Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs.
Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods.
This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - Revisit Input Perturbation Problems for LLMs: A Unified Robustness
Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models.
Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data.
Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.