Fugu-MT 論文翻訳(概要): Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

論文の概要: Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

arxiv url: http://arxiv.org/abs/2604.24715v1
Date: Mon, 27 Apr 2026 17:23:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.266245
Title: Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
Title（参考訳）: Long-Context Aware Upcycling - ハイブリッドLLMスケーリングのための新たなフロンティア
Authors: Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum,
Abstract要約: HyLoは、効率的な後トレーニングを通じて、使用可能なコンテキスト長を最大32ドルまで拡張する。 HyLoは、一貫して強い短文と長文のパフォーマンスを提供する。同様のスケールで、HyLo-Qwen-1.7Bは10Bのトークンのみを訓練し、GSM8K、Lm-Harness Common sense reasoning、RULER-64KでJetNemotronを上回った。
参考スコア（独自算出の注目度）: 25.551309705184234
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90\%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.
Abstract（参考訳）: 効率的なTransformerコンポーネントと線形シーケンスモデリングブロックを組み合わせたハイブリッドシーケンスモデルは、純粋なTransformerに代わる有望なものだが、その多くはスクラッチから事前訓練されているため、既存のTransformerチェックポイントを再利用できない。本研究では,事前学習したトランスフォーマーLLMを,短文品質を維持し,長文能力を向上させるとともにハイブリッドアーキテクチャに変換するための実践的な方法として,アップサイクリングについて検討する。我々は、アーキテクチャ適応を効率的なトランスフォーマーブロック、マルチヘッド遅延注意(MLA)、線形ブロック(Mamba2またはGated DeltaNet)と組み合わせた、長いコンテキストのトレーニングと教師誘導蒸留を組み合わせた長いコンテキストアップサイクルのレシピを、ステージングされた長期学習と安定した最適化のために呼び出す。 HyLoは、効率的なポストトレーニングを通じて使用可能なコンテキスト長を32\times$まで拡張し、KV-cacheメモリを90\%$以上削減し、最大2M-tokenプリフィルとデコードを可能にします。 1Bスケールと3Bスケールのセッティング(LlamaとQwenベースのバリエーション)を通じて、HyLoは一貫して短コンテキストと長コンテキストのパフォーマンスを提供し、RULERのような長コンテキスト評価に基づいて、最先端のアップサイクルハイブリッドベースラインを著しく上回っている。特にHyLo-Qwen-1.7Bは、GSM8K上のJetNemotron(400Bトークンで訓練された)、Lm-Harness Common sense reasoning、RULER-64Kの10Bトークンで訓練された。

論文の概要: Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

関連論文リスト