Fugu-MT 論文翻訳(概要): Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

論文の概要: Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

arxiv url: http://arxiv.org/abs/2601.21214v1
Date: Thu, 29 Jan 2026 03:24:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:49.54411
Title: Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models
Title（参考訳）: スケールする推論ホップは弱さを露呈する: 大規模言語モデルにおけるホップ一般化の最小化と改善
Authors: Zhaoyi Li, Jiatong Li, Gangwei Jiang, Linqi Song, Defu Lian, Ying Wei,
Abstract要約: CoT(Chain-of- Thought)推論は、LLM(Large Language Models)が複雑な問題を解決するための標準パラダイムとなっている。近年の研究では、ホップ一般化シナリオの推論性能が急落している。推論過程におけるEPヘッドを動的に識別・非活性化する軽量な介入法である推論の試験時間補正を提案する。
参考スコア（独自算出の注目度）: 66.36240676392502
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.
Abstract（参考訳）: CoT(Chain-of- Thought)推論は、LLM(Large Language Models)が複雑な問題を解決するための標準パラダイムとなっている。しかし、最近の研究では、基礎となるアルゴリズムが変わらず、必要な推論ステップの数がトレーニング分布を超えるという、ホップ一般化シナリオの推論性能の急落が明らかにされている。この失敗を誘発する内部メカニズムはいまだに理解されていない。本研究では,複数の領域からのタスクを系統的に研究し,エラーが一様に分散されるのではなく,いくつかの重要なエラータイプのトークン位置に集中していることを見出した。クローズドインスペクションにより、これらのトークンレベルの誤予測は内部競合機構に起因していることが明らかとなった: 特定の注意頭、誤処理ヘッド(ep head)と呼ばれ、誤った推論軌道を増幅し、正しい推論を抑えることでバランスを崩す。特に、推論中に個々のepヘッドを削除することは、しばしば正しい予測を復元することができる。本研究の目的は, 推論過程におけるEPヘッドの動的同定と非活性化を行う軽量な介入法である, 推論の試験時間補正を提案することである。様々なタスクやLLMにわたる広範な実験により、推論ホップの一般化は一貫して改善され、その効果と可能性の両方が強調される。

論文の概要: Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

関連論文リスト