Fugu-MT 論文翻訳(概要): Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

論文の概要: Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

arxiv url: http://arxiv.org/abs/2511.00797v1
Date: Sun, 02 Nov 2025 04:32:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.938578
Title: Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation
Title（参考訳）: インフレクション層における注意飽和とグラディエント抑制:トランスフォーマー適応におけるボトルネックの診断と緩和
Authors: Wang Zixian,
Abstract要約: 事前訓練されたトランスフォーマーは、ソースパターンに過剰な自信を示し、微調整中に新しいターゲットドメインパターンを形成するのが困難であることが多い。我々は、標準のクロスエントロピーおよびソフトマックス解析により、勾配抑制につながる出力飽和のメカニズムを定式化する。本稿では, 逆方向の抑圧信号の復元のために, インフレクション層にLoRAアダプタを選択的に挿入する, 診断ファーストでインジェクトライトの微調整戦略を提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning. We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis, showing that gradient suppression at inflection layers confines adaptation to high-level recombination of existing features while preventing low-level reconstruction. We introduce a set of layer-wise diagnostic metrics -- attention entropy (saturation proxy), activation gradient norm, parameter gradient norm, and Delta-CKA under a shared PCA basis -- to identify inflection layers characterized by both low attention entropy and steep gradient decay. Building on these findings, we propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore suppressed backward signals with minimal parameter overhead. Experiments on BERT-base transfer from SST-2 to Rotten Tomatoes under under-trained and over-trained source regimes reveal that over-trained initialization benefits from inflection-layer LoRA injection, while under-trained initialization suffers performance degradation. When base features are strong, unblocking inflection layers facilitates high-level compositional adaptation; when base features are weak, full-pathway unblocking is required for low-level reconstruction, as supported by joint analysis of layer-wise activation gradients and Delta-CKA dynamics.
Abstract（参考訳）: 事前訓練されたトランスフォーマーは、ソースパターンに過剰な自信を示し、微調整中に新しいターゲットドメインパターンを形成するのが困難であることが多い。我々は,標準のクロスエントロピーとソフトマックス解析による勾配抑制につながる出力飽和機構を定式化し,インフレクション層における勾配抑制は,低レベル再構成を防止しつつ,既存の特徴の高レベル再結合への適応を抑えることを示した。注意エントロピー(飽和プロキシ)、アクティベーション勾配ノルム、パラメータ勾配ノルム、デルタ-CKAを共有PCAベースで分析し、低アテンションエントロピーと急勾配減衰の両方を特徴とする反射層を同定する。そこで本研究では, 反射層にLoRAアダプタを選択的に挿入し, 最小パラメータオーバヘッドで抑制された後方信号の復元を行う, 診断ファーストでインジェクトライトを微調整する手法を提案する。 SST-2からRotten Tomatoesへ過度にトレーニングされたソース条件下でのBERT塩基移動実験では、過度にトレーニングされた初期化は反射層LoRA注入による恩恵を受けるが、過度にトレーニングされた初期化は性能劣化に悩むことが示された。ベース特性が強い場合、アンブロッキング・インフレクション・レイヤは、高レベルなコンポジション適応を促進するが、ベース特性が弱い場合には、階層的アクティベーション勾配とデルタ-CKAダイナミックスのジョイント解析によって支持されるように、フルパス・アンブロッキングが低レベルな再構築に必要である。

論文の概要: Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

関連論文リスト