Related papers: Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

URL: http://arxiv.org/abs/2406.12775v2
Date: Mon, 14 Oct 2024 09:55:12 GMT
Title: Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
Authors: Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, Amir Globerson,
Abstract summary: We study how large language models (LLMs) solve complex multi-step problems. understanding how the latent step is computed internally is key to understanding the overall computation. We propose a novel "back-patching" analysis method whereby a hidden representation from a later layer is patched back to an earlier layer.
Score: 39.438904598467154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) can solve complex multi-step problems, but little is known about how these computations are implemented internally. Motivated by this, we study how LLMs answer multi-hop queries such as "The spouse of the performer of Imagine is". These queries require two information extraction steps: a latent one for resolving the first hop ("the performer of Imagine") into the bridge entity (John Lennon), and another for resolving the second hop ("the spouse of John Lennon") into the target entity (Yoko Ono). Understanding how the latent step is computed internally is key to understanding the overall computation. By carefully analyzing the internal computations of transformer-based LLMs, we discover that the bridge entity is resolved in the early layers of the model. Then, only after this resolution, the two-hop query is solved in the later layers. Because the second hop commences in later layers, there could be cases where these layers no longer encode the necessary knowledge for correctly predicting the answer. Motivated by this, we propose a novel "back-patching" analysis method whereby a hidden representation from a later layer is patched back to an earlier layer. We find that in up to 66% of previously incorrect cases there exists a back-patch that results in the correct generation of the answer, showing that the later layers indeed sometimes lack the needed functionality. Overall, our methods and findings open further opportunities for understanding and improving latent reasoning in transformer-based LLMs.

Related papers

Do Language Models Use Their Depth Efficiently? [53.56816097840505]
We analyze the residual stream of the Llama 3.1 and Qwen 3 family of models.<n>We find that layers in the second half contribute much less than those in the first half.<n>For multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults.
arXiv Detail & Related papers (2025-05-20T04:00:56Z)
An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering [44.41915467956464]
Large Language Models (LLMs) frequently produce factually inaccurate outputs. This phenomenon limits their accuracy in knowledge-intensive NLP tasks. Recent research has explored training-free decoding strategies to improve the faithfulness of model generations.
arXiv Detail & Related papers (2025-03-30T12:18:21Z)
Inside-Out: Hidden Factual Knowledge in LLMs [50.79758420289131]
This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup.
arXiv Detail & Related papers (2025-03-19T15:21:48Z)
How Do LLMs Perform Two-Hop Reasoning in Context? [76.79936191530784]
Two-hop reasoning refers to the process of inferring a conclusion by making two logical steps.<n>Despite recent progress in large language models (LLMs), we surprisingly find that they can fail at solving simple two-hop reasoning problems.<n>We train a 3-layer Transformer from scratch on a synthetic two-hop reasoning task and reverse-engineer its internal information flow.
arXiv Detail & Related papers (2025-02-19T17:46:30Z)
LLMs as Method Actors: A Model for Prompt Engineering and Architecture [0.0]
We introduce "Method Actors" as a mental model for guiding LLM prompt engineering and prompt architecture. We show that a "Method Actors" approach can significantly improve LLM performance over both a vanilla and "Chain of Thoughts" approach. We also test OpenAI's newest model designed specifically for complex reasoning tasks, o1-preview.
arXiv Detail & Related papers (2024-11-08T18:45:06Z)
Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering [45.82437926569949]
Multi-Hop Question Answering tasks present a significant challenge for large language models. We introduce a novel generate-then-ground (GenGround) framework to solve a multi-hop question.
arXiv Detail & Related papers (2024-06-21T06:26:38Z)
Reasoning on Efficient Knowledge Paths:Knowledge Graph Guides Large Language Model for Domain Question Answering [18.94220625114711]
Large language models (LLMs) perform surprisingly well and outperform human experts on many tasks. This paper integrates and optimized a pipeline for selecting reasoning paths from KG based on LLM. We also propose a simple and effective subgraph retrieval method based on chain of thought (CoT) and page rank.
arXiv Detail & Related papers (2024-04-16T08:28:16Z)
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models [58.57279229066477]
We study how language models (LMs) solve retrieval tasks in diverse situations. We introduce ORION, a collection of structured retrieval tasks spanning six domains. We find that LMs internally decompose retrieval tasks in a modular way.
arXiv Detail & Related papers (2023-12-13T18:36:43Z)
Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers [73.28459749681879]
This paper focuses on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding. We unveil several key and uncommon findings based on the designed probing tasks.
arXiv Detail & Related papers (2023-12-07T14:50:41Z)
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities. It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism. We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z)
LaGR-SEQ: Language-Guided Reinforcement Learning with Sample-Efficient Querying [71.86163159193327]
Large language models (LLMs) have recently demonstrated their impressive ability to provide context-aware responses via text. This ability could potentially be used to predict plausible solutions in sequential decision making tasks pertaining to pattern completion. We introduce LaGR, which uses this predictive ability of LLMs to propose solutions to tasks that have been partially completed by a primary reinforcement learning (RL) agent.
arXiv Detail & Related papers (2023-08-21T02:07:35Z)
Modeling Multi-hop Question Answering as Single Sequence Prediction [88.72621430714985]
We propose a simple generative approach (PathFid) that extends the task beyond just answer generation. PathFid explicitly models the reasoning process to resolve the answer for multi-hop questions. Our experiments demonstrate that PathFid leads to strong performance gains on two multi-hop QA datasets.
arXiv Detail & Related papers (2022-05-18T21:57:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.