MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs
- URL: http://arxiv.org/abs/2406.13975v3
- Date: Fri, 20 Dec 2024 12:52:00 GMT
- Title: MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs
- Authors: Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia,
- Abstract summary: Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
- Score: 55.20845457594977
- License:
- Abstract: Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, evaluating these reasoning abilities has become increasingly challenging. Existing outcome-based benchmarks are beginning to saturate, becoming less effective in tracking meaningful progress. To address this, we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. Our meta-reasoning paradigm is especially suited for system-2 slow thinking, mirroring the human cognitive process of carefully examining assumptions, conditions, calculations, and logic to identify mistakes.MR-Ben comprises 5,975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, with models like the o1 series from OpenAI demonstrating strong performance by effectively scrutinizing the solution space, many other state-of-the-art models fall significantly behind on MR-Ben, exposing potential shortcomings in their training strategies and inference methodologies.
Related papers
- Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking [0.9709444454602557]
Large Language Models (LLMs) exhibit impressive reasoning abilities, yet their reliance on structured step-by-step reasoning reveals a critical limitation.
This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.
arXiv Detail & Related papers (2025-02-18T02:58:37Z) - Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models [42.70951894754312]
Integration of slow-thinking mechanisms into large language models offers a promising way toward Level 2 AGI Reasoners.
We propose a self-backtracking mechanism that equips LLMs with the ability to backtrack during both training and inference.
This mechanism not only enhances reasoning ability but also efficiency by transforming slow-thinking processes into fast-thinking through self-improvement.
arXiv Detail & Related papers (2025-02-06T08:52:43Z) - Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons [9.954960702259918]
This paper introduces Themis, a fine-tuned large language model (LLMs) judge that delivers context-aware evaluations.
We provide a comprehensive overview of the development pipeline for Themis, highlighting its scenario-dependent evaluation prompts.
We introduce two human-labeled benchmarks for meta-evaluation, demonstrating that Themis can achieve high alignment with human preferences in an economical manner.
arXiv Detail & Related papers (2025-02-05T08:35:55Z) - BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.
We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.
We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.
We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)
We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta [2.1249213103048414]
We introduce the EQUATOR Evaluator, which combines deterministic scoring with a focus on factual accuracy and robust reasoning assessment.
Using a vector database, EQUATOR pairs open-ended questions with human-evaluated answers, enabling more precise and scalable evaluations.
Our results demonstrate that this framework significantly outperforms traditional multiple-choice evaluations while maintaining high accuracy standards.
arXiv Detail & Related papers (2024-12-31T03:56:17Z) - Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles [20.18736445118689]
We introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit lateral thinking of Large Language Models (LLMs)
This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation.
Experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy.
arXiv Detail & Related papers (2024-10-09T10:09:11Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning [105.77733287326308]
We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following.
We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations.
Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
arXiv Detail & Related papers (2023-10-01T12:02:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.