Fugu-MT 論文翻訳(概要): Mamba Modulation: On the Length Generalization of Mamba

論文の概要: Mamba Modulation: On the Length Generalization of Mamba

arxiv url: http://arxiv.org/abs/2509.19633v1
Date: Tue, 23 Sep 2025 22:46:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-25 20:53:19.631681
Title: Mamba Modulation: On the Length Generalization of Mamba
Title（参考訳）: マンバ変調 : マンバの長大化について
Authors: Peng Lu, Jerry Huang, Qiuhao Zeng, Xinyu Wang, Boxing Wang, Philippe Langlais, Yufei Cui,
Abstract要約: Mambaはステートスペース言語モデルの主要なアーキテクチャである。プレトレーニング中に見られたものよりも長時間のコンテキストに適用した場合,マンバの性能は著しく低下することがわかった。本稿では,スペクトルスケーリングを事前学習したMambaモデルに適用して,堅牢な長期コンテキスト一般化を実現する手法を提案する。
参考スコア（独自算出の注目度）: 23.205469047706703
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N\Delta_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $\Delta_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.
Abstract（参考訳）: トランスフォーマーモデルにおけるアテンション機構の二次的複雑さは、状態空間モデルのような準二次スケーリングを持つ代替アーキテクチャの開発を動機付けている。このうち、Mambaは先進的なアーキテクチャとして登場し、様々な言語モデリングタスクで最先端の成果を上げている。しかし,プレトレーニング中に見られたものよりも長時間のコンテキストに適用した場合,Mambaの性能は著しく低下し,コンテキスト長拡張に対する感度が著しく向上した。詳細な解析を通して、この制限は状態遷移行列 $\mathbf{A}$ のパラメータ化において、状態空間力学の分布外挙動に起因する。この感度が離散化時間ステップの蓄積に寄与する最近の研究とは異なり、$\exp(-\sum_{t=1}^N\Delta_t)$ は入力長が無限大に近づくときの状態収束挙動と遷移行列 $\mathbf{A}$ のスペクトルとの接続を確立し、長さ延長におけるその役割をよく理解した説明を提供する。次に,事前学習したマンバモデルにスペクトルスケーリングを適用し,各層における$\mathbf{A}$行列のスペクトルを選択的に変調することにより,堅牢な長コンテキスト一般化を実現する手法を提案する。これにより、単に$\Delta_t$を変調するだけで、洞察を検証し、構造化遷移行列を持つ状態空間モデルのより長い一般化のための道を提供するような設定のパフォーマンスが大幅に向上することを示す。

論文の概要: Mamba Modulation: On the Length Generalization of Mamba

関連論文リスト