Fugu-MT 論文翻訳(概要): ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

論文の概要: ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

arxiv url: http://arxiv.org/abs/2508.17892v1
Date: Mon, 25 Aug 2025 10:59:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.740762
Title: ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models
Title（参考訳）: ILRe:因果言語モデルにおける文脈圧縮のための中間層検索
Authors: Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li,
Abstract要約: ILRe(Intermediate Layer Retrieval)と呼ばれる新しいコンテキスト圧縮パイプラインを導入する。 ILReは、チャンクされたプリフィルをその層にだけストリーミングすることでコンテキストをエンコードし、入力クエリと指定された層のフルキーキャッシュの間のアテンションスコアによってトークンをリコールする。追加のポストトレーニングやオペレータ開発がなければ、ILReは100万ドルのトークン要求を30分以内で処理できる。
参考スコア（独自算出の注目度）: 4.951427498576812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$, but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $\approx 180\times$) and scores RULER-$1M$ benchmark of $\approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.
Abstract（参考訳）: 大規模言語モデル(LLM)は多くのベンチマークで成功している。しかし、長いコンテキストのシナリオでは、主に短い有効コンテキスト長、二次計算の複雑さ、長い入力を処理する際のメモリオーバーヘッドが制限されている。これらの問題を緩和するために、中間層再帰(ILRe)と呼ばれる新しいコンテキスト圧縮パイプラインを導入し、1つの中間デコーダ層をオフラインで決定し、チャンクしたプリフィルをその層にのみストリーミングすることでコンテキストをエンコードし、入力クエリとその指定された層内の全キーキャッシュ間の注意スコアによるトークンをリコールする。特に,トークンリコールプロセスにおいて,セマンティクスの完全性を維持するための戦略を割り当てるマルチプールカーネルを提案する。我々のアプローチは、プリフィルの複雑さを$O(L^2)$から$O(L)$に還元するだけでなく、長いコンテキストシナリオにおけるフルコンテキストよりもパフォーマンスを向上する。追加のポストトレーニングやオペレーター開発がなければ、ILReは半分以内で100万ドル分のトークン要求を処理でき(スピードアップ$\approx 180\times$)、Huawei Ascend 910B NPUのモデルであるLlama-3.1-UltraLong-8B-1M-Instructで、RULER-$1M$ベンチマークの$\approx 79.8$をスコアする。

論文の概要: ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

関連論文リスト