MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
- URL: http://arxiv.org/abs/2312.17080v4
- Date: Wed, 5 Jun 2024 04:05:42 GMT
- Title: MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
- Authors: Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, Jiaya Jia,
- Abstract summary: We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
- Score: 60.65820977963331
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce a novel evaluation paradigm for Large Language Models (LLMs) that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," hence termed meta-reasoning, shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation that effectively distinguishes between the cognitive capabilities of different models. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark. Our extensive analysis includes several state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies. Notably, while models like Deepseek-v2 and Claude3-Sonnet closely competed with GPT-4 in GSM8K, their performance disparities expanded dramatically in MR-GSM8K, with differences widening to over 20 absolute points, underscoring the significant challenge posed by our meta-reasoning approach.
Related papers
- Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [0.7852714805965528]
We develop a set of 30 counterfactual scenarios and collect ratings across 8 evaluation metrics from 206 respondents.
We fine-tuned different Large Language Models to predict average or individual human judgment across these metrics.
arXiv Detail & Related papers (2024-10-28T15:33:37Z) - On the Evaluation Consistency of Attribution-based Explanations [42.1421504321572]
We introduce Meta-Rank, an open platform for benchmarking attribution methods in the image domain.
Our benchmark reveals three insights in attribution evaluation endeavors: 1) evaluating attribution methods under disparate settings can yield divergent performance rankings; 2) although inconsistent across numerous cases, the performance rankings exhibit remarkable consistency across distinct checkpoints along the same training trajectory; and 3) prior attempts at consistent evaluation fare no better than baselines when extended to more heterogeneous models and datasets.
arXiv Detail & Related papers (2024-07-28T11:49:06Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - CoUDA: Coherence Evaluation via Unified Data Augmentation [49.37157483044349]
Coherence evaluation aims to assess the organization and structure of a discourse.
We take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA.
With only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks.
arXiv Detail & Related papers (2024-03-31T13:19:36Z) - Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z) - DiversiGATE: A Comprehensive Framework for Reliable Large Language
Models [2.616506436169964]
We introduce DiversiGATE, a unified framework that consolidates diverse methodologies for LLM verification.
We propose a novel SelfLearner' model that conforms to the DiversiGATE framework and refines its performance over time.
Our results demonstrate that our approach outperforms traditional LLMs, achieving a considerable 54.8% -> 61.8% improvement on the GSM8K benchmark.
arXiv Detail & Related papers (2023-06-22T22:29:40Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.