Fugu-MT 論文翻訳(概要): Attention to Mamba: A Recipe for Cross-Architecture Distillation

論文の概要: Attention to Mamba: A Recipe for Cross-Architecture Distillation

arxiv url: http://arxiv.org/abs/2604.14191v1
Date: Wed, 01 Apr 2026 09:23:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.709553
Title: Attention to Mamba: A Recipe for Cross-Architecture Distillation
Title（参考訳）: マンバへの注意:クロスアーキテクチャ蒸留のレシピ
Authors: Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodríguez, Luca Zappella, Federico Danieli,
Abstract要約: Mambaのような状態空間モデル(SSM)はTransformerモデルの代替として人気がある。まず、カーネルトリックの適応を用いて、従来のトランスフォーマーから線形化されたアテンションに知識を蒸留する。全体として、蒸留されたマンバモデルは、下流のタスクでオリジナルのPythia-1Bトランスフォーマーのパフォーマンスを保ち、教師の13.86に近い14.11の難易度を維持することができる。
参考スコア（独自算出の注目度）: 15.145728951213997
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a naïve distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.
Abstract（参考訳）: Mambaのような状態空間モデル(SSM)は、メモリ消費が減少し、Attentionベースのモデルに比べてスループットが向上するため、Transformerモデルの代替として人気がある。一方、コミュニティはトランスフォーマーのトレーニング方法に関するかなりの知識を築き上げており、多くの事前訓練されたトランスフォーマーモデルがすぐに利用可能である。既存の事前学習トランスフォーマーを活用しつつ,SSMの採用を容易にするため,留意点に基づくモデルをマンバ様のアーキテクチャに蒸留する効果的なレシピを提案する。しかし, クロスアーキテクチャ蒸留の先行研究において, トランスフォーマーからマンバへのナイーブ蒸留法は, 本来の教師性能を維持できないことが示されている。我々の研究から得られた重要な主張は、マンバに原理化された初期化を組み込むことで、クロスアーキテクチャー蒸留の全体的なより良いレシピを復元できるということである。そこで本研究では,カーネルトリックの適応を用いて,従来のトランスフォーマーからの知識を線形化したアテンションに抽出する2段階の手法を提案する。次に、線形化されたバージョンを、いかなるアテンションブロックも使用しない適応型マンバモデルに蒸留する。全体として、蒸留されたマンバモデルは、下流のタスクでオリジナルのPythia-1Bトランスフォーマーのパフォーマンスを保ち、教師の13.86に近い14.11の難易度を維持することができる。提案手法の有効性を示すため,10Bトークンの配列ミキサーアーキテクチャ,モデルサイズおよび全蒸留トークンのスケール分析,ステージ間のトークン割り当ての感度解析を行った。

論文の概要: Attention to Mamba: A Recipe for Cross-Architecture Distillation

関連論文リスト