Related papers: MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

URL: http://arxiv.org/abs/2312.17080v4
Date: Wed, 5 Jun 2024 04:05:42 GMT
Title: MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
Authors: Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, Jiaya Jia,
Abstract summary: We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
Score: 60.65820977963331
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we introduce a novel evaluation paradigm for Large Language Models (LLMs) that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," hence termed meta-reasoning, shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation that effectively distinguishes between the cognitive capabilities of different models. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark. Our extensive analysis includes several state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies. Notably, while models like Deepseek-v2 and Claude3-Sonnet closely competed with GPT-4 in GSM8K, their performance disparities expanded dramatically in MR-GSM8K, with differences widening to over 20 absolute points, underscoring the significant challenge posed by our meta-reasoning approach.

Related papers

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation [78.96590724864606]
We introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium.<n>KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios.
arXiv Detail & Related papers (2025-05-20T16:06:32Z)
DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization? [17.97981669263259]
Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks. This study systematically compares reasoning-based LLMs with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks.
arXiv Detail & Related papers (2025-04-10T20:39:18Z)
Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment [25.13605642785304]
Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model.
arXiv Detail & Related papers (2025-03-14T09:26:07Z)
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [5.02953506943752]
MM-IQ is a comprehensive evaluation framework that comprises a large-scale training set with 4,776 visual reasoning problems and 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.<n>Our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance.<n>Inspired by the recent surge of large reasoning models, we also release a multimodal reasoning model as the baseline that is trained via reinforcement learning with verifiable reward functions.
arXiv Detail & Related papers (2025-02-02T07:12:03Z)
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective [90.86370957353911]
Chain-of-Reasoning (CoR) is a novel unified framework that integrates multiple reasoning paradigms. CoR generates multiple potential answers using different reasoning paradigms and synthesizes them into a coherent final solution. Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models.
arXiv Detail & Related papers (2025-01-19T16:53:26Z)
Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [0.7852714805965528]
We develop a set of 30 counterfactual scenarios and collect ratings across 8 evaluation metrics from 206 respondents. We fine-tuned different Large Language Models to predict average or individual human judgment across these metrics.
arXiv Detail & Related papers (2024-10-28T15:33:37Z)
On the Evaluation Consistency of Attribution-based Explanations [42.1421504321572]
We introduce Meta-Rank, an open platform for benchmarking attribution methods in the image domain. Our benchmark reveals three insights in attribution evaluation endeavors: 1) evaluating attribution methods under disparate settings can yield divergent performance rankings; 2) although inconsistent across numerous cases, the performance rankings exhibit remarkable consistency across distinct checkpoints along the same training trajectory; and 3) prior attempts at consistent evaluation fare no better than baselines when extended to more heterogeneous models and datasets.
arXiv Detail & Related papers (2024-07-28T11:49:06Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making. We present a process-based benchmark MR-Ben that demands a meta-reasoning skill. Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
CoUDA: Coherence Evaluation via Unified Data Augmentation [49.37157483044349]
Coherence evaluation aims to assess the organization and structure of a discourse. We take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA. With only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks.
arXiv Detail & Related papers (2024-03-31T13:19:36Z)
Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity. We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z)
DiversiGATE: A Comprehensive Framework for Reliable Large Language Models [2.616506436169964]
We introduce DiversiGATE, a unified framework that consolidates diverse methodologies for LLM verification. We propose a novel SelfLearner' model that conforms to the DiversiGATE framework and refines its performance over time. Our results demonstrate that our approach outperforms traditional LLMs, achieving a considerable 54.8% -> 61.8% improvement on the GSM8K benchmark.
arXiv Detail & Related papers (2023-06-22T22:29:40Z)
KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility. Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z)
Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models. We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.