Fugu-MT 論文翻訳(概要): READER: Retrieval-Assisted Drafter for Efficient LLM Inference

論文の概要: READER: Retrieval-Assisted Drafter for Efficient LLM Inference

arxiv url: http://arxiv.org/abs/2508.09072v2
Date: Sat, 27 Sep 2025 20:13:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:18.759981
Title: READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Title（参考訳）: READER: 効率的なLLM推論のための検索支援ドレター
Authors: Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Stanislav Ilyushin, Sultan Isali, Vasily Kalugin, Nuriza Aitassova, Fei Yi, Weidi Zeng,
Abstract要約: 自己回帰言語モデルはトークンシーケンスよりも分解された確率をインスタンス化するが、その厳密なシーケンシャルなデコーディングプロセスは、遅延推論に固有の低いバウンドを課す。このボトルネックは、大規模生成モデルのスケーラブルなデプロイにおける中心的な障害として現れています。本稿では,補助的ドラフトモデルのトレーニングを回避した投機的復号化フレームワークREADERを提案する。
参考スコア（独自算出の注目度）: 0.0386965802948046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on inference latency. This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models. Existing acceleration techniques partially mitigate token-level latency by relying on auxiliary draft models or introducing an additional training phase, but fail to address the dominant memory and communication costs. We present READER, a provably lossless speculative decoding framework that bypasses the training of the auxiliary draft model. READER formalizes speculative decoding as a stochastic tree construction problem and exploits the empirical redundancy structure of natural language to generate high-probability candidate continuations. Our method revisits the problem of constructing draft trees, establishing substantial statistical improvements over stochastic draft-tree methods and providing a complexity-theoretic analysis that characterizes the optimality frontier of speculative decoding under bounded computation and memory resources. Beyond the single-sequence regime traditionally considered in prior work, we introduce a memory-optimal key-value cache-serving strategy that guarantees amortized sublinear overhead in the batch dimension, allowing READER to scale to realistic inference workloads. Comprehensive experiments demonstrate up to 6.13x wall-clock speedup on single-prompt inference and up to 5.92x on batched inference, consistently surpassing prior speculative decoding baselines, while preserving exact output equivalence, with even more pronounced gains in retrieval-augmented generation pipelines. Our results close a key gap between theoretical parallelism limits and practical LLM inference, suggesting a new standard for efficient deployment.
Abstract（参考訳）: 自己回帰言語モデルはトークンシーケンスよりも分解された確率をインスタンス化するが、その厳密なシーケンシャルなデコーディングプロセスは、推論レイテンシーに固有の低いバウンドを課す。このボトルネックは、大規模生成モデルのスケーラブルなデプロイにおける中心的な障害として現れています。既存のアクセラレーション技術は、補助的なドラフトモデルや追加のトレーニングフェーズの導入によってトークンレベルのレイテンシを部分的に軽減するが、支配的なメモリと通信コストには対処できない。本稿では、補助的ドラフトモデルのトレーニングを回避した、確実な損失のない投機的復号化フレームワークREADERを提案する。 READERは確率木構築問題として投機的復号法を定式化し、自然言語の経験的冗長構造を利用して高確率候補継続を生成する。提案手法では, 提案手法の問題点を再検討し, 確率的ドラフトツリー法に対する統計的改善と, 有界計算およびメモリ資源下での投機的復号化の最適性フロンティアを特徴付ける複雑性理論解析を提供する。従来の作業で考慮されていた単一シーケンス方式以外にも,メモリ最適化キー値キャッシュ保存戦略を導入し,バッチ次元におけるアモータライズされたサブ線形オーバーヘッドを保証し,READERを現実的な推論処理にスケール可能にする。総合的な実験では、単一プロンプト推論で最大6.13倍のウォールタイムのスピードアップ、バッチ推論で最大5.92倍の速度アップを示し、予測デコードベースラインを一貫して上回り、正確な出力等価性を保ちながら、検索強化された生成パイプラインでさらに顕著な利得が得られる。提案手法は,理論的並列性限界と実用的LCM推論との間に重要なギャップを埋めるものであり,効率的な展開のための新しい標準が提案されている。

論文の概要: READER: Retrieval-Assisted Drafter for Efficient LLM Inference

関連論文リスト