Related papers: Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

URL: http://arxiv.org/abs/2508.04699v1
Date: Wed, 06 Aug 2025 17:58:36 GMT
Title: Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis
Authors: Anushka Yadav, Isha Nalawade, Srujana Pillarichety, Yashwanth Babu, Reshmi Ghosh, Samyadeep Basu, Wenlong Zhao, Ali Nasaeh, Sriram Balasubramanian, Soundararajan Srinivasan,
Abstract summary: Reasoning models and their integration into practical AI chat bots have led to breakthroughs in solving advanced math, deep search, and extractive question answering problems.<n>Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing.<n>In this study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks.
Score: 3.711555701154055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

Related papers

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context [26.506057678587176]
Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers.<n>The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information.<n>We introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions.
arXiv Detail & Related papers (2025-06-26T14:01:03Z)
Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions [100.41062461003389]
We show that framing reasoning as a search process helps the model "connect the dots" between fragmented knowledge and produce extended reasoning traces in non-reasoning models.<n>We evaluate our method across three benchmarks and observe consistent improvements.
arXiv Detail & Related papers (2025-06-10T15:51:16Z)
Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering [24.446222685949227]
Large language models (LLMs) face challenges in knowledge-intensive multi-hop reasoning.<n>We propose Self-Critique Guided Iterative Reasoning (SiGIR)<n>SiGIR uses self-critique feedback to guide the iterative reasoning process.
arXiv Detail & Related papers (2025-05-25T12:10:24Z)
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks [42.022527376404476]
Embodied Reasoner is a model that extends o1 style reasoning to interactive embodied search tasks.<n>We synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes.<n>We develop a three-stage training pipeline that progressively enhances the model's capabilities.
arXiv Detail & Related papers (2025-03-27T17:00:51Z)
Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
The Superalignment of Superhuman Intelligence with Large Language Models [63.96120398355404]
We discuss the concept of superalignment from the learning perspective to answer this question.<n>We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation.<n>We present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing.
arXiv Detail & Related papers (2024-12-15T10:34:06Z)
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning [53.45295657891099]
This paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity.
arXiv Detail & Related papers (2024-10-04T11:18:41Z)
Conceptual and Unbiased Reasoning in Language Models [98.90677711523645]
We propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions. We show that existing large language models fall short on conceptual reasoning, dropping 9% to 28% on various benchmarks. We then discuss how models can improve since high-level abstract reasoning is key to unbiased and generalizable decision-making.
arXiv Detail & Related papers (2024-03-30T00:53:53Z)
Re-Reading Improves Reasoning in Large Language Models [87.46256176508376]
We introduce a simple, yet general and effective prompting method, Re2, to enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs) Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), Re2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. We evaluate Re2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality.
arXiv Detail & Related papers (2023-09-12T14:36:23Z)
Piecing Together Clues: A Benchmark for Evaluating the Detective Skills of Large Language Models [44.42887452269389]
Detectives frequently engage in information detection and reasoning simultaneously when making decisions across various cases. We introduce the DetectBench, a reading comprehension dataset designed to assess a model's ability to jointly ability in key information detection and multi-hop reasoning. To enhance model's detective skills, we propose the Detective Thinking Framework. These methods encourage models to identify all possible clues within the context before reasoning.
arXiv Detail & Related papers (2023-07-11T08:45:46Z)
Causal Reasoning Meets Visual Representation Learning: A Prospective Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.