MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions
- URL: http://arxiv.org/abs/2509.22750v1
- Date: Fri, 26 Sep 2025 07:31:01 GMT
- Title: MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions
- Authors: Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella, Akriti Jain, Nedim Lipka, Hwanhee Lee,
- Abstract summary: Real-world Multi-hop Question Answering (QA) often involves ambiguity that is inseparable from the reasoning process itself.<n>This ambiguity creates a distinct challenge, where multiple reasoning paths emerge from a single question.<n>We introduce MultI-hop Reasoning with AmbiGuity Evaluation for Illusory Questions (MIRAGE) to analyze and evaluate this challenging intersection.
- Score: 25.695038634265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world Multi-hop Question Answering (QA) often involves ambiguity that is inseparable from the reasoning process itself. This ambiguity creates a distinct challenge, where multiple reasoning paths emerge from a single question, each requiring independent resolution. Since each sub-question is ambiguous, the model must resolve ambiguity at every step. Thus, answering a single question requires handling multiple layers of ambiguity throughout the reasoning chain. We find that current Large Language Models (LLMs) struggle in this setting, typically exploring wrong reasoning paths and producing incomplete answers. To facilitate research on multi-hop ambiguity, we introduce MultI-hop Reasoning with AmbiGuity Evaluation for Illusory Questions (MIRAGE), a benchmark designed to analyze and evaluate this challenging intersection of ambiguity interpretation and multi-hop reasoning. MIRAGE contains 1,142 high-quality examples of ambiguous multi-hop questions, categorized under a taxonomy of syntactic, general, and semantic ambiguity, and curated through a rigorous multi-LLM verification pipeline. Our experiments reveal that even state-of-the-art models struggle on MIRAGE, confirming that resolving ambiguity combined with multi-step inference is a distinct and significant challenge. To establish a robust baseline, we propose CLarifying Ambiguity with a Reasoning and InstructiON (CLARION), a multi-agent framework that significantly outperforms existing approaches on MIRAGE, paving the way for more adaptive and robust reasoning systems.
Related papers
- Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks [54.31998314008198]
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks.<n>We attribute this limitation to textbfreasoning overconfidence: a tendency to express undue certainty in an incomplete solution set.<n>We propose the textbfcognitive-rigidity hypothesis, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths.
arXiv Detail & Related papers (2025-12-01T14:35:06Z) - More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering [53.09478307383865]
We introduce BiasPrompting, a novel inference framework for large language models (LLMs)<n>It guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction.<n>It demonstrates significant improvements in five widely used multiple-choice question answering benchmarks.
arXiv Detail & Related papers (2025-11-25T09:01:08Z) - DTKG: Dual-Track Knowledge Graph-Verified Reasoning Framework for Multi-Hop QA [8.598540768292809]
Multi-hop reasoning for question answering plays a critical role in retrieval-augmented generation.<n>We propose a novel dual-track KG verification and reasoning framework DTKG.
arXiv Detail & Related papers (2025-10-18T02:19:11Z) - MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI [59.196131618912005]
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs)<n>Existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities.<n>We introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability.
arXiv Detail & Related papers (2025-06-30T07:14:38Z) - Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion?<n>Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues.<n>We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z) - Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1) [66.51642638034822]
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks.<n>Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains.<n>This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs.
arXiv Detail & Related papers (2025-04-04T04:04:56Z) - Empowering LLMs with Logical Reasoning: A Comprehensive Survey [49.91445266392609]
Large language models (LLMs) have achieved remarkable successes on various tasks.<n>Recent studies have found that there are still significant challenges to the logical reasoning abilities of LLMs.
arXiv Detail & Related papers (2025-02-21T18:20:35Z) - An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback Mechanism [14.479060028732803]
We argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges.<n>The retrieved evidence containing a large amount of redundant information leads to a significant drop in performance.<n>The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions.
arXiv Detail & Related papers (2024-12-08T05:47:55Z) - FLARE: Faithful Logic-Aided Reasoning and Exploration [47.46564769245296]
We introduce a novel approach for traversing the problem space using task decompositions.<n>We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.<n>Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z) - BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering [29.442468366125986]
We propose BeamAggR, a reasoning framework for knowledge-intensive multi-hop QA.
We parse complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning.
For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates.
For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory.
arXiv Detail & Related papers (2024-06-28T10:53:48Z) - Open-ended Commonsense Reasoning with Unrestricted Answer Scope [47.14397700770702]
Open-ended Commonsense Reasoning is defined as solving a commonsense question without providing 1) a short list of answer candidates and 2) a pre-defined answer scope.
In this work, we leverage pre-trained language models to iteratively retrieve reasoning paths on the external knowledge base.
The reasoning paths can help to identify the most precise answer to the commonsense question.
arXiv Detail & Related papers (2023-10-18T02:45:54Z) - Faithful Reasoning Using Large Language Models [12.132449274592668]
We show how LMs can be made to perform faithful multi-step reasoning via a process whose causal structure mirrors the underlying logical structure of the problem.
Our approach works by chaining together reasoning steps, where each step results from calls to two fine-tuned LMs.
We demonstrate the effectiveness of our model on multi-step logical deduction and scientific question-answering, showing that it outperforms baselines on final answer accuracy.
arXiv Detail & Related papers (2022-08-30T13:44:41Z) - Locate Then Ask: Interpretable Stepwise Reasoning for Multi-hop Question
Answering [71.49131159045811]
Multi-hop reasoning requires aggregating multiple documents to answer a complex question.
Existing methods usually decompose the multi-hop question into simpler single-hop questions.
We propose an interpretable stepwise reasoning framework to incorporate both single-hop supporting sentence identification and single-hop question generation.
arXiv Detail & Related papers (2022-08-22T13:24:25Z) - GMH: A General Multi-hop Reasoning Model for KG Completion [37.01406934111068]
Current models typically perform short distance reasoning.
Long-distance reasoning is also vital with the ability to connect the superficially unrelated entities.
We propose a general model which resolves the issues with three modules.
arXiv Detail & Related papers (2020-10-15T09:30:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.