Fugu-MT 論文翻訳(概要): Does RoPE Prevent or Degrade Retrieval Heads? A Mechanistic Analysis Across Model Families

論文の概要: Does RoPE Prevent or Degrade Retrieval Heads? A Mechanistic Analysis Across Model Families

arxiv url: http://arxiv.org/abs/2606.21249v1
Date: Fri, 19 Jun 2026 09:23:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 07:05:01.104882
Title: Does RoPE Prevent or Degrade Retrieval Heads? A Mechanistic Analysis Across Model Families
Title（参考訳）: RoPEは検索用ヘッドを予防または劣化させるか? : モデルファミリ間の力学的解析
Authors: Cengizhan Bayram,
Abstract要約: 検索ヘッダは、以前のコンテキストから現在の位置への情報をコピーする。回転位置埋め込みは、ベースハイパーパラメータテータで減衰することで、クエリとキーを回転させる。マルチヘッドとグループドクリーアテンションにまたがる4つのオープンウェイト7-8Bモデルと100倍の範囲のテータを試験した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval heads, attention heads that copy information from earlier context to the current position, have been proposed as the mechanistic substrate for long-context recall. Rotary position embeddings (RoPE) rotate queries and keys by frequencies decaying with a base hyperparameter theta, and a natural hypothesis is that this rotation either prevents retrieval heads from forming or degrades their function. We test both across four open-weight 7-8B models spanning multi-head and grouped-query attention and a 100x range of theta, using paired-seed needle-in-a-haystack tests, layer-clustered permutation, and causal head-masking. (i) Retrieval heads are causally necessary: masking the 87 detected heads in OLMo-2 collapses recall from 1.00 to 0.00, while masking matched random heads has no effect; this replicates in Qwen. (ii) Higher theta does not reduce retrieval-head count (LLaMA-3.1 at theta=500K has 47 heads vs LLaMA-2 at theta=10K with 42), refuting the prevention hypothesis. (iii) The norm-utility relation is family-specific and significant in opposite directions (Qwen d=-0.49, OLMo d=+0.50, both significant; LLaMA null); since OLMo and LLaMA-3.1 share theta=500K yet differ, the effect is not theta-driven. (iv) Building on Chiang and Yogatama (2025), a controlled patch shows that zeroing the lowest-frequency RoPE dimensions of retrieval heads degrades recall dose-dependently (1.00 to 0.18 when 32 of 128 dimensions are zeroed, vs 0.98 for random dimensions); the effect is head-specific and task-specific. The causal variable is RoPE frequency, not norm-utility. The direction holds in all five models patched (OLMo-2, Qwen2.5-7B/14B, Gemma-2, Mistral) across four lineages and two scales. We do not claim cross-model magnitude. Code and a paired-seed harness are released.
Abstract（参考訳）: 従来の文脈から現在の位置へ情報をコピーする検索ヘッドは、長文リコールのための機械的基盤として提案されている。ロータリー位置埋め込み(RoPE)は、ベースハイパーパラメータテータで減衰する周波数でクエリとキーを回転させ、この回転が検索ヘッドの形成や機能低下を防ぐという自然な仮説である。マルチヘッドおよびグループクエリアテンションにまたがる4つのオープンウェイト7-8Bモデルと100倍の範囲のテータを,ペア型ニードル・イン・ア・ヘイスタック試験,層クラスター置換,因果頭部マスキングを用いて試験した。 i) 検出された87個の頭部をOLMo-2でマスクすると1.00から0.00にリコールされる一方、一致したランダムな頭部のマスクは効果がなく、これはQwenで複製される。 (II)高いシータは検索ヘッド数を減らさない(theta=500KのLLaMA-3.1は47頭、theta=10KのLLaMA-2は42頭)。 3) 標準効用関係は家族固有で、反対方向(Qwen d=-0.49, OLMo d=+0.50, both significant; LLaMA null)であり、OLMo と LLaMA-3.1 は、theta=500K を共有しているため、その効果はテータ駆動ではない。 (4)Chiang and Yogatama(2025)上に構築した制御パッチは、検索ヘッドの低周波 RoPE 次元をゼロにすると、リコール量依存的に劣化する(128次元のうち32がゼロになった場合1.00〜0.18、ランダム次元の場合は0.98)。因果変数はRoPE周波数であり、標準効用ではない。この方向は、パッチされた5つのモデル(OLMo-2、Qwen2.5-7B/14B、Gemma-2、Mistral)を4つの系統と2つのスケールで保持する。我々はクロスモデル等級を主張しない。コードとペアシードハーネスがリリースされている。

論文の概要: Does RoPE Prevent or Degrade Retrieval Heads? A Mechanistic Analysis Across Model Families

関連論文リスト