Related papers: Lessons from Studying Two-Hop Latent Reasoning

Lessons from Studying Two-Hop Latent Reasoning

URL: http://arxiv.org/abs/2411.16353v3
Date: Sat, 06 Sep 2025 13:57:19 GMT
Title: Lessons from Studying Two-Hop Latent Reasoning
Authors: Mikita Balesni, Tomek Korbak, Owain Evans,
Abstract summary: We introduce a controlled setting for investigating two-hop reasoning in large language models.<n>We test two-hop reasoning over synthetic facts.<n>We observe a nuanced picture: Models fail to compose two synthetic facts, but can succeed when one fact is synthetic and the other is natural.
Score: 8.154468580021792
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models can use chain-of-thought (CoT) to externalize reasoning, potentially enabling oversight of capable LLM agents. Prior work has shown that models struggle at two-hop question-answering without CoT. This capability is so basic that if it was a fundamental limitation, it would imply that many complex agentic tasks would similarly require CoT. We investigate LLM latent reasoning capabilities using two-hop question answering as a case study. Previous work on the gap between latent and externalized two-hop reasoning produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where a positive result provides definitive evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture: Models fail to compose two synthetic facts, but can succeed when one fact is synthetic and the other is natural. These results demonstrate that LLMs are undeniably capable of latent two-hop reasoning, although it remains unclear how this ability scales with model size. Finally, we highlight a lesson for researchers studying LLM reasoning: when drawing conclusions about LLM latent reasoning, one must be careful to avoid both spurious successes (that stem from memorization and reasoning shortcuts) and spurious failures (that may stem from artificial experimental setups, divorced from training setups of frontier LLMs).

Related papers

Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing [63.96040994220329]
We find that SFT-based methods, e.g., Reason-KE, suffer from a "faithfulness gap"<n>This gap enables the LLM's powerful parametric priors to override new contextual facts.<n>We propose Reason-KE++, an SFT+RL framework that instills process-level faithfulness.
arXiv Detail & Related papers (2025-11-16T15:49:01Z)
Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? [62.17959154852391]
Causal reasoning capability is critical in advancing large language models toward strong artificial intelligence.<n>We show that large language models (LLMs) are only capable of performing shallow (level-1) causal reasoning.<n>We propose G2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs' causal reasoning processes.
arXiv Detail & Related papers (2025-06-26T13:11:01Z)
Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs [28.556628696390767]
Large Language Models (LLMs) demonstrate impressive reasoning capabilities.<n>Evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference.<n>We propose a five-level answer-visibility prompt framework that systematically manipulates answer cues and probes model behavior through indirect, behavioral analysis.
arXiv Detail & Related papers (2025-06-21T08:15:45Z)
Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition [11.422434149376478]
Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities.<n>In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks.<n>Recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features.
arXiv Detail & Related papers (2025-04-04T20:57:36Z)
How Do LLMs Perform Two-Hop Reasoning in Context? [76.79936191530784]
Two-hop reasoning refers to the process of inferring a conclusion by making two logical steps.<n>Despite recent progress in large language models (LLMs), we surprisingly find that they can fail at solving simple two-hop reasoning problems.<n>We train a 3-layer Transformer from scratch on a synthetic two-hop reasoning task and reverse-engineer its internal information flow.
arXiv Detail & Related papers (2025-02-19T17:46:30Z)
LLMs can implicitly learn from mistakes in-context [15.818061010632249]
We investigate whether Large Language Models (LLMs) can learn from mistakes in mathematical reasoning tasks when explanations are not provided. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context. This approach also substantially outperforms chain-of-thought prompting in our evaluations.
arXiv Detail & Related papers (2025-02-12T16:31:21Z)
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [53.84130385074551]
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) We find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA) With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks.
arXiv Detail & Related papers (2025-02-11T08:48:48Z)
Failure Modes of LLMs for Causal Reasoning on Narratives [51.19592551510628]
We investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge.
arXiv Detail & Related papers (2024-10-31T12:48:58Z)
On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems. We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers? [6.525065859315515]
We investigate whether Large Language Models (LLMs) are prone to exploiting simplifying cues in multi-hop reasoning benchmarks. Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains. We find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives.
arXiv Detail & Related papers (2024-09-08T19:22:58Z)
Can Large Language Models Reason? A Characterization via 3-SAT [11.422434149376478]
Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. Recent works have shown that LLMs often bypass true reasoning using shortcuts, sparking skepticism. We propose an experimental protocol centered on 3-SAT -- the NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks.
arXiv Detail & Related papers (2024-08-13T21:54:10Z)
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs) In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z)
A Comprehensive Evaluation on Event Reasoning of Large Language Models [68.28851233753856]
How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. We find that LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory.
arXiv Detail & Related papers (2024-04-26T16:28:34Z)
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z)
Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning [34.34977150518316]
textscCoTErrorSet, a new benchmark with 609,432 questions, each designed with both correct and error references. textbfSelf-rethinking prompting guides LLMs to rethink whether they have made similar previous mistakes. textbfMistake tuning involves finetuning models in both correct and incorrect reasoning domains.
arXiv Detail & Related papers (2024-03-29T08:30:34Z)
Do Large Language Models Latently Perform Multi-Hop Reasoning? [33.41309859079347]
We study whether Large Language Models (LLMs) latently perform multi-hop reasoning with complex prompts such as "The mother of the singer of 'Superstition' is" We find strong evidence of latent multi-hop reasoning for the prompts of certain relation types, with the reasoning pathway used in more than 80% of the prompts.
arXiv Detail & Related papers (2024-02-26T18:57:54Z)
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications. This paper evaluates LLMs' reasoning abilities in competitive environments. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z)
Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers. We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z)
Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism [19.590120229602103]
Large language models (LLMs) take advantage of step-by-step reasoning instructions, e.g., chain-of-thought (CoT) prompting. In this study, we inspect the step-by-step reasoning ability of LLMs with a focus on negation.
arXiv Detail & Related papers (2023-10-23T12:40:41Z)
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities. It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism. We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z)
Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate [19.887103433032774]
Large language models (LLMs) have shown impressive performance in complex reasoning tasks. This work explores testing LLMs' reasoning by engaging with them in a debate-like conversation. We find that despite their impressive performance, LLMs like ChatGPT cannot maintain their beliefs in truth for a significant portion of examples.
arXiv Detail & Related papers (2023-05-22T15:47:31Z)
Locate Then Ask: Interpretable Stepwise Reasoning for Multi-hop Question Answering [71.49131159045811]
Multi-hop reasoning requires aggregating multiple documents to answer a complex question. Existing methods usually decompose the multi-hop question into simpler single-hop questions. We propose an interpretable stepwise reasoning framework to incorporate both single-hop supporting sentence identification and single-hop question generation.
arXiv Detail & Related papers (2022-08-22T13:24:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.