Evaluating Step-by-step Reasoning Traces: A Survey
- URL: http://arxiv.org/abs/2502.12289v2
- Date: Sat, 24 May 2025 21:09:17 GMT
- Title: Evaluating Step-by-step Reasoning Traces: A Survey
- Authors: Jinu Lee, Julia Hockenmaier,
- Abstract summary: Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems.<n>Existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development.<n>This survey proposes a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility)
- Score: 3.895864050325129
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility). Based on the taxonomy, we review different evaluator implementations and recent findings, leading to promising directions for future research.
Related papers
- Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It [1.6261897792391753]
We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi.<n>We uncover pervasive flaws in both benchmark items and evaluation methodology.
arXiv Detail & Related papers (2025-06-30T13:57:28Z) - Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks [18.613353004764885]
This study reveals novel insights into the limitations of existing methods.<n>We propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.
arXiv Detail & Related papers (2025-05-28T15:17:34Z) - From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z) - MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks [0.0]
We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks.
We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators.
We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.
arXiv Detail & Related papers (2025-04-18T19:01:53Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling [27.86204841898399]
Reward modeling in large language models is susceptible to reward hacking.<n>We propose Context-Aware Reward Modeling (CARMO) to mitigate this problem.<n>We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1% improvement on Reward Bench.
arXiv Detail & Related papers (2024-10-28T21:18:49Z) - Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles [20.18736445118689]
We introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit lateral thinking of Large Language Models (LLMs)
This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation.
Experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy.
arXiv Detail & Related papers (2024-10-09T10:09:11Z) - A Critical Look at Meta-evaluating Summarisation Evaluation Metrics [11.541368732416506]
We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics.
We call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal.
arXiv Detail & Related papers (2024-09-29T01:30:13Z) - Ranking evaluation metrics from a group-theoretic perspective [5.333192842860574]
We show instances resulting in inconsistent evaluations, sources of potential mistrust in commonly used metrics.
Our analysis sheds light on ranking evaluation metrics, highlighting that inconsistent evaluations should not be seen as a source of mistrust.
arXiv Detail & Related papers (2024-08-14T09:06:58Z) - A Critical Review of Causal Reasoning Benchmarks for Large Language Models [2.1311710788645617]
We present a comprehensive overview of LLM benchmarks for causality.
We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy.
arXiv Detail & Related papers (2024-07-10T20:11:51Z) - A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations [35.12731651234186]
Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities.
We systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations.
Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
arXiv Detail & Related papers (2024-07-04T17:15:37Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - CheckEval: Robust Evaluation Framework using Large Language Model via Checklist [6.713203569074019]
We introduce CheckEval, a novel evaluation framework using Large Language Models.
CheckEval addresses the challenges of ambiguity and inconsistency in current evaluation methods.
arXiv Detail & Related papers (2024-03-27T17:20:39Z) - HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z) - Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories.
Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - Post Turing: Mapping the landscape of LLM Evaluation [22.517544562890663]
This paper traces the historical trajectory of Large Language Models (LLMs) evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research.
We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models.
This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.
arXiv Detail & Related papers (2023-11-03T17:24:50Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.