MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
- URL: http://arxiv.org/abs/2312.17080v4
- Date: Wed, 5 Jun 2024 04:05:42 GMT
- Title: MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
- Authors: Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, Jiaya Jia,
- Abstract summary: We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
- Score: 60.65820977963331
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce a novel evaluation paradigm for Large Language Models (LLMs) that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," hence termed meta-reasoning, shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation that effectively distinguishes between the cognitive capabilities of different models. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark. Our extensive analysis includes several state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies. Notably, while models like Deepseek-v2 and Claude3-Sonnet closely competed with GPT-4 in GSM8K, their performance disparities expanded dramatically in MR-GSM8K, with differences widening to over 20 absolute points, underscoring the significant challenge posed by our meta-reasoning approach.
Related papers
- Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective [90.86370957353911]
Chain-of-Reasoning (CoR) is a novel unified framework that integrates multiple reasoning paradigms.
CoR generates multiple potential answers using different reasoning paradigms and synthesizes them into a coherent final solution.
Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models.
arXiv Detail & Related papers (2025-01-19T16:53:26Z) - Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field [0.0]
This paper offers an analysis of the ability of large models to identify semantic relationships between different research topics.
We developed a gold standard based on the IEEE Thesaurus to evaluate the task.
Several models have achieved outstanding results, including Mixtral-8x7B, Dolphin-Mistral, and Claude 3-7B.
arXiv Detail & Related papers (2024-12-11T10:11:41Z) - Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [0.7852714805965528]
We develop a set of 30 counterfactual scenarios and collect ratings across 8 evaluation metrics from 206 respondents.
We fine-tuned different Large Language Models to predict average or individual human judgment across these metrics.
arXiv Detail & Related papers (2024-10-28T15:33:37Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - CoUDA: Coherence Evaluation via Unified Data Augmentation [49.37157483044349]
Coherence evaluation aims to assess the organization and structure of a discourse.
We take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA.
With only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks.
arXiv Detail & Related papers (2024-03-31T13:19:36Z) - Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z) - DiversiGATE: A Comprehensive Framework for Reliable Large Language
Models [2.616506436169964]
We introduce DiversiGATE, a unified framework that consolidates diverse methodologies for LLM verification.
We propose a novel SelfLearner' model that conforms to the DiversiGATE framework and refines its performance over time.
Our results demonstrate that our approach outperforms traditional LLMs, achieving a considerable 54.8% -> 61.8% improvement on the GSM8K benchmark.
arXiv Detail & Related papers (2023-06-22T22:29:40Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.