Fugu-MT 論文翻訳(概要): Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

論文の概要: Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

arxiv url: http://arxiv.org/abs/2605.28769v1
Date: Wed, 27 May 2026 17:26:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.248441
Title: Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations
Title（参考訳）: マルチミラーモデル:共有表現を用いたフレキシブルシーケンスモデリング
Authors: Kevin Y. Li, Asher Trockman, Ananda Theertha Suresh, Ziteng Sun,
Abstract要約: 我々は、異なるミキサーを切り替えて効率よく生成できるハイブリッドモデルOryxを提案する。 Oryxは、ミキサー間で少なくとも90%のパラメータを結び、注意と繰り返しモードが共有内部表現上で動作できるようにする。 Mamba-2 と Gated DeltaNet の 1.4B モデルで設計を検証した。
参考スコア（独自算出の注目度）: 22.554254134162225
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Softmax attention is the cornerstone of modern large language models, but its memory scales linearly and compute quadratically with sequence length. Linear recurrent models, such as linear attention and state space models, have become widely studied as alternatives to attention due to their linear compute and constant memory. While these sub-quadratic token mixing methods, or mixers, achieve promising efficiency gains and competitive results on a wide range of benchmarks, current linear recurrent models still lag behind on tasks that require long-context retrieval or in-context learning. A growing body of work studies hybrid architectures that attempt to mitigate these trade-offs by statically interleaving or merging attention and recurrent blocks. In this work, we explore a new axis of developing hybrid models: across the token sequence. We propose Oryx, a hybrid model that can, throughout a sequence, flexibly switch between different mixers, for example quadratic attention for rich context utilization and linear recurrences for efficient generation. Oryx ties at least 90% of its parameters across mixers, enabling attention and recurrent modes to operate over shared internal representations. We validate our design with Mamba-2 and Gated DeltaNet variants, up to 1.4B models. Under fixed token budgets and a mixed-training strategy, Oryx achieves comparable or better performance than its single-mixer baselines. At the 1.4B scale, all instances of Oryx outperform their respective baselines by at least 0.7 percentage points on averaged language modeling tasks. On retrieval tasks, Oryx achieves performance comparable to the Transformer baseline even when processing only a tiny fraction (<10%) of the tokens in attention mode. These results suggest that attention and linear recurrent models can share internal representations, and motivate sequence-axis hybridization as a promising direction.
Abstract（参考訳）: ソフトマックス・アテンション(Softmax attention)は、現代の大規模言語モデルの基盤であるが、そのメモリは線形にスケールし、シーケンス長を2次的に計算する。線形アテンションや状態空間モデルなどの線形リカレントモデルは、線形計算と定数メモリによる注意の代替として広く研究されている。これらのサブクワッドラティックなトークンミキシング手法(またはミキサー)は、幅広いベンチマークで有望な効率向上と競争的な結果を達成するが、現在の線形リカレントモデルは、長いコンテキスト検索やコンテキスト内学習を必要とするタスクに遅れを取っている。成長する研究機関は、静的にインターリーブしたり、注意と繰り返し発生するブロックをマージすることによって、これらのトレードオフを緩和しようとするハイブリッドアーキテクチャを研究している。本研究では,トークン列にまたがるハイブリッドモデル開発の新しい軸について検討する。本稿では,複数のミキサーを柔軟に切り替えることができるハイブリッドモデルであるOryxを提案する。 Oryxは、ミキサー間で少なくとも90%のパラメータを結び、注意と繰り返しモードが共有内部表現上で動作できるようにする。 Mamba-2 と Gated DeltaNet の 1.4B モデルで設計を検証した。固定トークン予算と混合トレーニング戦略の下で、Oryxはシングルミキサーベースラインよりも同等または優れたパフォーマンスを達成する。 1.4Bスケールでは、Oryxのすべてのインスタンスは、平均的な言語モデリングタスクにおいて、それぞれのベースラインを少なくとも0.7%上回っている。検索タスクでは、注意モードのトークンのごく一部(10%)しか処理していない場合でも、OryxはTransformerベースラインに匹敵するパフォーマンスを達成する。これらの結果は、注意と線形リカレントモデルが内部表現を共有でき、シーケンス軸のハイブリダイゼーションを有望な方向として動機付けることを示唆している。

論文の概要: Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

関連論文リスト