Related papers: Retrieval Heads are Dynamic

Retrieval Heads are Dynamic

URL: http://arxiv.org/abs/2602.11162v1
Date: Wed, 07 Jan 2026 02:29:24 GMT
Title: Retrieval Heads are Dynamic
Authors: Yuping Lin, Zitao Li, Yue Xing, Pengfei He, Yingqian Cui, Yaliang Li, Bolin Ding, Jingren Zhou, Jiliang Tang,
Abstract summary: Recent studies have identified "retrieval heads" in Large Language Models (LLMs)<n>In this paper, we investigate retrieval heads from a dynamic perspective.
Score: 101.60087217027949
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.

Related papers

From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models [19.62954865335739]
This work investigates whether retrieval heads can be leveraged to enhance the long-context capabilities of LLMs.<n>We propose RetMask, a method that generates training signals by contrasting normal model outputs with those from an ablated variant in which the retrieval heads are masked.<n> Experiments across three model families reveal that the effectiveness depends on retrieval head organization.
arXiv Detail & Related papers (2026-01-16T06:31:08Z)
Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey [92.71325249013535]
Deliberative tree search is a cornerstone of Large Language Model (LLM) research.<n>This paper introduces a unified framework that deconstructs search algorithms into three core components.
arXiv Detail & Related papers (2025-10-11T03:29:18Z)
Learning Interpretable Hierarchical Dynamical Systems Models from Time Series Data [6.3128614613706295]
We introduce a hierarchical framework that enables to harvest group-level (multi-domain) information while retaining single-domain characteristics.<n>In addition to faithful reconstruction of all individual dynamical regimes, our unsupervised methodology discovers common low-dimensional feature spaces.
arXiv Detail & Related papers (2024-10-07T07:54:53Z)
Retrieval Head Mechanistically Explains Long-Context Factuality [56.78951509492645]
We show that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
arXiv Detail & Related papers (2024-04-24T00:24:03Z)
Exploring the Practicality of Generative Retrieval on Dynamic Corpora [41.223804434693875]
In this paper, we focus on Generative Retrievals (GR), which apply autoregressive language models to IR problems. Our results on the StreamingQA benchmark demonstrate that GR is more adaptable to evolving knowledge (4-11%), robust in learning knowledge with temporal information, and efficient in terms of FLOPs (x6), indexing time (x6), and storage footprint (x4) Our paper highlights the potential of GR for future use in practical IR systems within dynamic environments.
arXiv Detail & Related papers (2023-05-27T16:05:00Z)
PAD-Net: An Efficient Framework for Dynamic Networks [72.85480289152719]
Common practice in implementing dynamic networks is to convert the given static layers into fully dynamic ones. We propose a partially dynamic network, namely PAD-Net, to transform the redundant dynamic parameters into static ones. Our method is comprehensively supported by large-scale experiments with two typical advanced dynamic architectures.
arXiv Detail & Related papers (2022-11-10T12:42:43Z)
Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks [29.47784194895489]
Action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS) are studied. Most examined models are biased toward static information. Some datasets that are assumed to be biased toward dynamics are actually biased toward static information.
arXiv Detail & Related papers (2022-11-03T13:17:53Z)
A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information [34.595367958746856]
We analyse two widely studied tasks, action recognition and video object segmentation. Most examined models are biased toward static information. Certain two-stream architectures with cross-connections show a better balance between the static and dynamic information captured.
arXiv Detail & Related papers (2022-06-06T18:39:37Z)
Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning [114.36124979578896]
We design a dynamic mechanism using offline reinforcement learning algorithms. Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set.
arXiv Detail & Related papers (2022-05-05T05:44:26Z)
Static-Dynamic Co-Teaching for Class-Incremental 3D Object Detection [71.18882803642526]
Deep learning approaches have shown remarkable performance in the 3D object detection task. They suffer from a catastrophic performance drop when incrementally learning new classes without revisiting the old data. This "catastrophic forgetting" phenomenon impedes the deployment of 3D object detection approaches in real-world scenarios. We present the first solution - SDCoT, a novel static-dynamic co-teaching method.
arXiv Detail & Related papers (2021-12-14T09:03:41Z)
Variational Predictive Routing with Nested Subjective Timescales [1.6114012813668934]
We present Variational Predictive Routing (PRV) - a neural inference system that organizes latent video features in a temporal hierarchy. We show that VPR is able to detect event boundaries, disentangletemporal features, adapt to the dynamics hierarchy of the data, and produce accurate time-agnostic rollouts of the future.
arXiv Detail & Related papers (2021-10-21T16:12:59Z)
Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants. Standard attention heads learn a rigid mapping between search and retrieval. We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.