Fugu-MT 論文翻訳(概要): Priming: Hybrid State Space Models From Pre-trained Transformers

論文の概要: Priming: Hybrid State Space Models From Pre-trained Transformers

arxiv url: http://arxiv.org/abs/2605.08301v1
Date: Fri, 08 May 2026 11:43:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.548385
Title: Priming: Hybrid State Space Models From Pre-trained Transformers
Title（参考訳）: プリミング:事前訓練されたトランスフォーマーによるハイブリッドステートスペースモデル
Authors: Aditya Chattopadhyay, Elvis Nunez, Prannay Kaul, Benjamin Bowman, Evan Becker, Luca Zancato, David Thomas, Wei Xia, Stefano Soatto,
Abstract要約: プライミング(英: Priming)とは、ハイブリッドアーキテクチャ設計を事前学習問題から知識伝達問題に変換する手法である。我々は,Gated KalmaNet (GKA), Gated DeltaNet (GDN), Mamba-2を評価し,その階層であるGKA>GDN>Mamba-2が,長文推論タスクのダウンストリーム性能を直接予測していることを示す。我々のハイブリッドGKA 32Bは、ソースであるQwen3-32Bを+3.8平均推論ポイントで改善し、同じデータで後トレーニングされたトランスフォーマーの1%以内に留まる。
参考スコア（独自算出の注目度）: 45.220597209553866
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama, Mistral), model class (dense or Mixture-of-Experts), and model scale. Priming enables us to run the first controlled comparison of SSM layer types at scale under identical conditions. We evaluate, Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2, and show that their expressiveness hierarchy, GKA>GDN>Mamba-2, directly predicts downstream performance on long-context reasoning tasks. We scale Priming to 8B/32B reasoning models with native 128K contexts. Our Hybrid GKA 32B improves over its source Qwen3-32B by +3.8 average reasoning points, while staying within 1% of a Transformer post-trained on the same data and enabling up to 2.3x higher decode throughput. To foster research on Hybrid architectures, we release a model zoo of primed Hybrid models for long-context reasoning and instruction following, together with the Priming training and inference code (Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and vLLM serving plugin), all under Apache~2.0 License.
Abstract（参考訳）: ハイブリッドステートスペースモデルは、アテンションとリカレントステートスペースモデル(SSM)レイヤを組み合わせることで、アテンションからのイデオティックメモリとSSMからの圧縮フェードメモリのバランスをとる。これにより、キーバリューキャッシュが小さくなり、トランスフォーマーよりも高速なデコードが可能となり、よりリッチなアーキテクチャ設計空間が実現される。大規模な設計スペースの探索には,これまではゼロからトレーニングが必要でした。そこで本研究では,ハイブリッドアーキテクチャ設計を事前学習問題から知識伝達問題に変換する手法であるPrimingを紹介する。プライミングは、事前トレーニングされたトランスフォーマーからハイブリッドモデルを初期化し、短いアライメントと後トレーニングフェーズを通じて、ソースモデルの事前トレーニングトークン予算の0.5%未満を使用して、下流品質を回復する。プライミングは、ソースTransformerファミリー(例えば、Qwen、Llama、Mistral)、モデルクラス(denseまたはMixture-of-Experts)、モデルスケールに依存しない。プライミングにより、同一条件下でSSM層を大規模に比較することが可能となる。我々は,Gated KalmaNet (GKA), Gated DeltaNet (GDN), Mamba-2を評価し,その表現性階層であるGKA>GDN>Mamba-2が,長文推論タスクのダウンストリーム性能を直接予測していることを示す。プライミングをネイティブな128Kコンテキストで8B/32Bの推論モデルにスケールします。我々のハイブリッドGKA 32Bは、ソースのQwen3-32Bを+3.8平均推論ポイントで改善し、同じデータ上で後トレーニングされたトランスフォーマーの1%以内に留まり、最大2.3倍高いデコードスループットを実現した。ハイブリッドアーキテクチャの研究を促進するために,プライミングトレーニングと推論コード(長時間コンテキストトレーニングのためのシーケンス並列化アルゴリズム,最適化されたGKAカーネル,vLLMサービスプラグイン)とともに,長期コンテキスト推論と命令のためのプライマリハイブリッドモデルのモデル動物園をApache～2.0ライセンス下でリリースする。

論文の概要: Priming: Hybrid State Space Models From Pre-trained Transformers

関連論文リスト