Fugu-MT 論文翻訳(概要): Delayed Attention Training Improves Length Generalization in Transformer--RNN Hybrids

論文の概要: Delayed Attention Training Improves Length Generalization in Transformer--RNN Hybrids

arxiv url: http://arxiv.org/abs/2510.00258v1
Date: Tue, 30 Sep 2025 20:31:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.245857
Title: Delayed Attention Training Improves Length Generalization in Transformer--RNN Hybrids
Title（参考訳）: 遅延注意訓練は変圧器-RNNハイブリッドの長さ一般化を改善する
Authors: Buu Phan, Reza Ebrahimi, Sanjay Haresh, Roland Memisevic,
Abstract要約: 本研究では、状態追跡と連想的リコールの両方を含む複合問題に対して、シーケンスモデルにおける長さ一般化について検討する。更新されたネットワークは状態トラッキングをうまく処理するが、リコールに苦労する。我々は,この効果を緩和し,長さ一般化性能を大幅に向上させる,シンプルで効果的なトレーニング戦略,すなわち注意層のトレーニングを遅らせる手法を提案する。
参考スコア（独自算出の注目度）: 8.159215234052573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study length generalization in sequence models on a composite problem involving both state tracking and associative recall. Prior work finds that recurrent networks handle state tracking well but struggle with recall, whereas Transformers excel at recall yet fail to extend state-tracking capabilities to longer sequences. Motivated by the complementary strengths of these architectures, we construct hybrid models integrating recurrent and attention-based components, and train them on the combined task to evaluate whether both capabilities can be preserved. Our results reveal that, in such hybrids, the Transformer component tends to exploit shortcut solutions, leading to poor length generalization. We identify this shortcut reliance as a key obstacle and propose a simple yet effective training strategy -- delaying the training of the attention layers -- that mitigates this effect and significantly improves length generalization performance. Our experiments show that this approach enables hybrid models to achieve near-perfect accuracy ($>90\%$) on hybrid sequences three times longer than those used during training.
Abstract（参考訳）: 本研究では、状態追跡と連想的リコールの両方を含む複合問題に対して、シーケンスモデルにおける長さ一般化について検討する。以前の作業では、リカレントネットワークは状態トラッキングをうまく扱うが、リコールに苦労している。これらのアーキテクチャの相補的な強みによって、繰り返しおよび注意に基づくコンポーネントを統合したハイブリッドモデルを構築し、両方の機能を保持できるかどうかを評価するために、それらを組み合わせたタスクで訓練する。以上の結果から,Transformer コンポーネントはショートカットの手法を利用する傾向があることが判明した。我々は、このショートカット依存を重要な障害として認識し、この効果を軽減し、長さ一般化性能を大幅に改善する、シンプルで効果的なトレーニング戦略である注意層のトレーニングを遅らせることを提案する。実験の結果, ハイブリッドモデルでは, トレーニング中に使用したモデルに比べて3倍の精度で, ほぼ完全精度(>90\%$)が得られることがわかった。

論文の概要: Delayed Attention Training Improves Length Generalization in Transformer--RNN Hybrids

関連論文リスト