What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation
- URL: http://arxiv.org/abs/2510.20603v1
- Date: Thu, 23 Oct 2025 14:30:37 GMT
- Title: What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation
- Authors: Heejin Do, Jaehui Hwang, Dongyoon Han, Seong Joon Oh, Sangdoo Yun,
- Abstract summary: We decompose reasoning quality into two dimensions: relevance and coherence.<n>To measure these aspects reliably, we introduce causal stepwise evaluation (CaSE)<n>We show that curating training data with CaSE-evaluated relevance and coherence directly improves final task performance.
- Score: 67.47463575774388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating large language models (LLMs) on final-answer correctness is the dominant paradigm. This approach, however, provides a coarse signal for model improvement and overlooks the quality of the underlying reasoning process. We argue that a more granular evaluation of reasoning offers a more effective path to building robust models. We decompose reasoning quality into two dimensions: relevance and coherence. Relevance measures if a step is grounded in the problem; coherence measures if it follows logically from prior steps. To measure these aspects reliably, we introduce causal stepwise evaluation (CaSE). This method assesses each reasoning step using only its preceding context, which avoids hindsight bias. We validate CaSE against human judgments on our new expert-annotated benchmarks, MRa-GSM8K and MRa-MATH. More importantly, we show that curating training data with CaSE-evaluated relevance and coherence directly improves final task performance. Our work provides a scalable framework for analyzing, debugging, and improving LLM reasoning, demonstrating the practical value of moving beyond validity checks.
Related papers
- Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning [34.43632129774481]
In this paper, we quantify and investigate the potential reason -- imbalanced evaluation preference.<n>Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference.
arXiv Detail & Related papers (2025-11-13T13:37:45Z) - FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities.<n>We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move.<n>We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z) - Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback [94.25162866972077]
Step-KTO is a training framework that combines process-level and outcome-level binary feedback.<n>Our experiments show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps.
arXiv Detail & Related papers (2025-01-18T15:38:03Z) - Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.<n>This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.<n>We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning [15.59540726867483]
We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness.
Inspired by the findings that $textitoutcome supervision for guided decoding essentially acts as a value model, we propose Outcome-supervised Value Model (OVM)
Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model.
arXiv Detail & Related papers (2023-11-16T09:56:28Z) - Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM)
We propose a decoding algorithm integrating the self-evaluation guidance via beam search.
Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.