Fugu-MT 論文翻訳(概要): Olmo Hybrid: From Theory to Practice and Back

論文の概要: Olmo Hybrid: From Theory to Practice and Back

arxiv url: http://arxiv.org/abs/2604.03444v1
Date: Fri, 03 Apr 2026 20:36:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.582481
Title: Olmo Hybrid: From Theory to Practice and Back
Title（参考訳）: Olmo Hybrid: 理論から実践まで
Authors: William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, Ashish Sabharwal,
Abstract要約: ハイブリッドモデルは, 変圧器と線形RNNの表現性を継承するだけでなく, 両方以上のタスクを表現できることを示す。また,Olmo HybridはOlmo 3よりも高い性能を示し,プレトレーニングおよび中間トレーニングの評価を行った。この結果から,注目層と繰り返し層を混合したハイブリッドモデルが,言語モデリングパラダイムの強力な拡張となることが示唆された。
参考スコア（独自算出の注目度）: 108.39077753720733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.
Abstract（参考訳）: 最近の研究は、非トランスフォーマー言語モデル、特に線形リカレントニューラルネットワーク(RNN)と、繰り返しと注意を混ぜたハイブリッドモデルの可能性を実証している。しかし、これらの新しいアーキテクチャの潜在的な利点がそれらをスケールアップするリスクと労力を正当化するかどうかについては合意がない。この問題に対処するために、いくつかの面における純変圧器よりもハイブリッドモデルの利点を実証する。まず、理論的には、ハイブリッドモデルが単に変換器や線形RNNの表現性を継承するだけでなく、コード実行など、両方の超越したタスクを表現できることが示される。この理論を実践するために、私たちはOlmo Hybridをトレーニングしました。Olmo 3 7Bに匹敵する7Bパラメータモデルですが、スライドウィンドウ層はGated DeltaNetレイヤに置き換えられています。我々は,Olmo HybridがOlmo 3よりも高い性能を示し,制御された大規模環境でのハイブリッドモデルの利点を実証した。ハイブリッドモデルはトランスよりもはるかに効率的にスケールでき、高い性能を説明できる。しかし, 特定の形式的問題に対する表現性の向上が, それらの問題とは無関係な下流タスクのスケーリングや性能向上をもたらすのかは, 明らかでない。この明らかなギャップを説明するために、我々は理論に戻り、なぜ表現率の増加がより優れたスケーリング効率に変換されるべきなのかを議論する。全体として,注意層と再帰層を混合したハイブリッドモデルは,推論時のメモリ削減だけでなく,事前学習時により優れたスケールの表現力のあるモデルを得るための基礎的な方法として,言語モデリングパラダイムの強力な拡張であることが示唆された。

論文の概要: Olmo Hybrid: From Theory to Practice and Back

関連論文リスト