Fugu-MT 論文翻訳(概要): FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

論文の概要: FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

arxiv url: http://arxiv.org/abs/2605.09932v1
Date: Mon, 11 May 2026 03:30:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.49605
Title: FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
Title（参考訳）: FocuSFT: 希釈を考慮した長期微調整のための双方向最適化
Authors: Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu,
Abstract要約: FocuSFTは、大規模言語モデルの双方向最適化フレームワークである。応答の因果マスキングを維持しながら、コンテキストトークンに対して双方向の注意を払っている。注意分析により、FocuSFTは注意シンクの質量を529$times$で減らし、トレーニング中にコンテキストエンゲージメントを3倍にすることが示された。
参考スコア（独自算出の注目度）: 46.87750193423974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529$\times$ and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT
Abstract（参考訳）: 大規模言語モデルは、ますます長い入力を処理できるようになったが、長いコンテキストにまたがる情報を効果的に活用する能力は、依然として限られている。位置バイアスと注意シンクは、意味的に関係のあるコンテンツではなく、そのほとんどの注意を、位置的に特権付けられたトークンに割り当てる。このトレーニングタイムアテンション希釈(注意分布におけるコンテンツトークンの飢餓)は勾配信号を弱め、モデルが堅牢な長期コンテキスト能力を学ぶ能力を制限する。トレーニング時にこの問題に対処する双方向最適化フレームワークであるFocuSFTを紹介した。内ループは、トレーニングコンテキスト上の軽量な高速パラメータに適応して、関連するコンテンツに注意を集中するパラメトリックメモリを形成し、外ループは、このシャープ化された表現に条件付けされたSFTを実行する。両方のループは、応答に対する因果マスクを保持しながら、コンテキストトークンに対して双方向の注意を向け、注意シンクを引き起こす因果非対称性を減らし、インナー・アウターの振る舞いを整列させる。 BABILongでは、FocuSFTは4K-32Kコンテキスト長で+14ppまで精度を向上し、RULERではCWEアグリゲーションを72.9\%から81.1\%まで16Kで上げ、GPQAではエージェントツールの使用でパス@1で24\%の利得を得る。注意分析により、FocuSFTは注意シンクの質量を529$\times$に減らし、トレーニング中にコンテキストエンゲージメントを3倍にすることが示された。コード:https://github.com/JarvisPei/FocuSFT

論文の概要: FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

関連論文リスト