Fugu-MT 論文翻訳(概要): Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

論文の概要: Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

arxiv url: http://arxiv.org/abs/2601.22156v1
Date: Thu, 29 Jan 2026 18:59:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:50.111591
Title: Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
Title（参考訳）: ハイブリッドリニアアテンションが正しい: 極めて長いコンテキストに対する効率的な蒸留と効果的なアーキテクチャ
Authors: Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu,
Abstract要約: 本稿では,トランスフォーマーモデルをRNN-アテンションハイブリッドモデルに蒸留するためのパイプラインであるHALOを提案する。そこで,提案するHypeNetは,新しい位置符号化方式により,より優れた長さの一般化を実現したハイブリッドアーキテクチャである。変換には2.3Bトークンしか必要とせず、事前トレーニングデータの0.01%以下である。
参考スコア（独自算出の注目度）: 27.8245634187787
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data
Abstract（参考訳）: ソフトマックスアテンションブロックとリカレントニューラルネットワーク(RNN)を組み合わせたハイブリッドトランスフォーマーアーキテクチャは、長期コンテキストモデリングに望ましいパフォーマンススループットトレードオフを示しているが、その採用と研究は、大規模な事前トレーニングをスクラッチから禁止するコストによって妨げられている。近年の研究では、パラメータ転送と知識蒸留により、事前学習したソフトマックスアテンションブロックをRNNブロックに変換することが示されている。しかし、これらのトランスファー手法は、かなりの量のトレーニングデータ(10Bトークン以上)を必要とするため、結果として得られたハイブリッドモデルは、長文のパフォーマンスも劣る。本稿では、トランスフォーマーモデルをRNN-アテンションハイブリッドモデルに蒸留するためのパイプラインであるHALO(Hybrid Attention via Layer Optimization)を提案する。そこで,HypeNetを提案する。HypeNetは,新しい位置符号化方式(HyPE)と様々なアーキテクチャ変更により,より優れた長さの一般化を実現したハイブリッドアーキテクチャである。我々は、Qwen3シリーズをHALOを用いてHypeNetに変換し、より優れた長文性能と効率を享受しながら、オリジナルのTransformerモデルに匹敵する性能を達成する。変換には2.3Bトークンしか必要とせず、事前トレーニングデータの0.01%以下である。

論文の概要: Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

関連論文リスト