Do Large Language Models Latently Perform Multi-Hop Reasoning?
- URL: http://arxiv.org/abs/2402.16837v1
- Date: Mon, 26 Feb 2024 18:57:54 GMT
- Title: Do Large Language Models Latently Perform Multi-Hop Reasoning?
- Authors: Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, Sebastian
Riedel
- Abstract summary: We study whether Large Language Models (LLMs) latently perform multi-hop reasoning with complex prompts such as "The mother of the singer of 'Superstition' is"
We find strong evidence of latent multi-hop reasoning for the prompts of certain relation types, with the reasoning pathway used in more than 80% of the prompts.
- Score: 33.41309859079347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study whether Large Language Models (LLMs) latently perform multi-hop
reasoning with complex prompts such as "The mother of the singer of
'Superstition' is". We look for evidence of a latent reasoning pathway where an
LLM (1) latently identifies "the singer of 'Superstition'" as Stevie Wonder,
the bridge entity, and (2) uses its knowledge of Stevie Wonder's mother to
complete the prompt. We analyze these two hops individually and consider their
co-occurrence as indicative of latent multi-hop reasoning. For the first hop,
we test if changing the prompt to indirectly mention the bridge entity instead
of any other entity increases the LLM's internal recall of the bridge entity.
For the second hop, we test if increasing this recall causes the LLM to better
utilize what it knows about the bridge entity. We find strong evidence of
latent multi-hop reasoning for the prompts of certain relation types, with the
reasoning pathway used in more than 80% of the prompts. However, the
utilization is highly contextual, varying across different types of prompts.
Also, on average, the evidence for the second hop and the full multi-hop
traversal is rather moderate and only substantial for the first hop. Moreover,
we find a clear scaling trend with increasing model size for the first hop of
reasoning but not for the second hop. Our experimental findings suggest
potential challenges and opportunities for future development and applications
of LLMs.
Related papers
- Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models [26.603700269575025]
We show that bridge entities are computed sequentially across layers before later-hop answers.<n>We propose a framework that models multi-hop reasoning as broad recall in shallow layers followed by selective extraction in deeper attention layers.
arXiv Detail & Related papers (2026-01-07T03:13:03Z) - How Do LLMs Perform Two-Hop Reasoning in Context? [76.79936191530784]
Two-hop reasoning refers to the process of inferring a conclusion by making two logical steps.<n>Despite recent progress in large language models (LLMs), we surprisingly find that they can fail at solving simple two-hop reasoning problems.<n>We train a 3-layer Transformer from scratch on a synthetic two-hop reasoning task and reverse-engineer its internal information flow.
arXiv Detail & Related papers (2025-02-19T17:46:30Z) - The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A-->C [1.8177391253202122]
This paper introduces a controlled setting for investigating two-hop reasoning in LLMs.
We find that models can perform latent reasoning when facts appear together during training or in the prompt.
We call this complete failure to compose separately learned facts the Two-Hop Curse.
arXiv Detail & Related papers (2024-11-25T13:04:28Z) - Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning [0.0]
Language models (LMs) struggle to perform multi-hop reasoning consistently.
We propose an approach to pinpoint and rectify multi-hop reasoning failures through targeted memory injections on LM attention heads.
arXiv Detail & Related papers (2024-11-06T16:30:26Z) - Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers? [6.525065859315515]
We investigate whether Large Language Models (LLMs) are prone to exploiting simplifying cues in multi-hop reasoning benchmarks.
Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains.
We find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives.
arXiv Detail & Related papers (2024-09-08T19:22:58Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers.
We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z) - MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop
Queries [22.4349439498591]
Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge.
Existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence.
We develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence.
arXiv Detail & Related papers (2024-01-27T11:41:48Z) - Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination? [73.454943870226]
This work studies a specific type of hallucination induced by semantic associations.
To quantify this phenomenon, we propose a novel probing method and benchmark called EureQA.
arXiv Detail & Related papers (2023-11-16T09:27:36Z) - Towards a Mechanistic Interpretation of Multi-Step Reasoning
Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities.
It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism.
We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z) - Locate Then Ask: Interpretable Stepwise Reasoning for Multi-hop Question
Answering [71.49131159045811]
Multi-hop reasoning requires aggregating multiple documents to answer a complex question.
Existing methods usually decompose the multi-hop question into simpler single-hop questions.
We propose an interpretable stepwise reasoning framework to incorporate both single-hop supporting sentence identification and single-hop question generation.
arXiv Detail & Related papers (2022-08-22T13:24:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.