Fugu-MT 論文翻訳(概要): Stacked from One: Multi-Scale Self-Injection for Context Window Extension

論文の概要: Stacked from One: Multi-Scale Self-Injection for Context Window Extension

arxiv url: http://arxiv.org/abs/2603.04759v1
Date: Thu, 05 Mar 2026 03:16:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-06 22:06:11.050245
Title: Stacked from One: Multi-Scale Self-Injection for Context Window Extension
Title（参考訳）: Stacked from One: コンテキストウィンドウ拡張のためのマルチスケールセルフインジェクション
Authors: Wei Han, Pan Zhou, Shuicheng Yan,
Abstract要約: Modelnameは、多粒度コンテキスト圧縮とクエリ対応情報取得に基づく新しいフレームワークである。 modelnameachievesパフォーマンスは、強いベースラインと同等か、優れている。
参考スコア（独自算出の注目度）: 69.24689919827817
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).
Abstract（参考訳）: 現代の大規模言語モデル(LLM)の限られたコンテキストウィンドウは、さまざまなドメインにわたる幅広いアプリケーションにとって、依然として主要なボトルネックとなっている。長いコンテキストデータに対する連続的な事前トレーニングは、簡単なソリューションを提供するが、禁忌なデータ取得と計算コストを発生させる。この課題に対処するために,多粒度コンテキスト圧縮とクエリ対応情報取得に基づく新しいフレームワークである~\modelnameを提案する。共有LLMは、圧縮機として機能する下級モデルとデコーダとして機能する上級モデルである。下位モデルは、長い入力を圧縮してコンパクトで多粒度な表現とし、その後、コンテキスト認識処理のために上位モデルに転送する。効率を最大化するために、この情報転送は最小層のみに行われ、長いフォワードパスと冗長なクロスアテンション操作をバイパスする。この過程の全体は、上と下にあるモデルは同じ LLM 層から導出され、~\textit{self-injection} と呼ばれる。このアーキテクチャをサポートするために、特別なツリーベースのデータ構造は、コンテキスト情報の効率的なエンコーディングとクエリアウェア検索を可能にする。たった8Kトークンのシーケンスで訓練されているにもかかわらず、 \modelname~は128Kトークンを超える入力に効果的に一般化する。長いコンテキストモデリングと理解のベンチマークの包括的なスイートの中で、‘modelname~achievesパフォーマンスが優れているか、あるいは強力なベースラインに匹敵するので、効率と精度の最適なバランスが取れます。さらに、これらの設計選択により、 \modelname~はメモリフットプリントを大幅に減らし、ストリーミングよりも2\times$、エンコーダ-デコーダアーキテクチャよりも3\times$で顕著な推論スピードアップを得ることができる。

論文の概要: Stacked from One: Multi-Scale Self-Injection for Context Window Extension

関連論文リスト