Fugu-MT 論文翻訳(概要): READER: Retrieval-Assisted Drafter for Efficient LLM Inference

論文の概要: READER: Retrieval-Assisted Drafter for Efficient LLM Inference

arxiv url: http://arxiv.org/abs/2508.09072v1
Date: Tue, 12 Aug 2025 16:47:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.510768
Title: READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Title（参考訳）: READER: 効率的なLLM推論のための検索支援ドレター
Authors: Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Sultan Isali, Vasily Kalugin, Stanislav Ilyushin, Nuriza Aitassova, Yi Fei, Zeng Weidi,
Abstract要約: 大規模言語モデル(LLM)はトークンを自己回帰的に生成し、それぞれのトークンは前のコンテキストに依存する。本稿では,テキスト中の自己繰り返しを活用することによって,モデルに基づくアプローチを強化する新しい投機的復号法READERを紹介する。提案アルゴリズムは,統計的探索により得られたトークンを用いて投機的復号木を拡張する。
参考スコア（独自算出の注目度）: 0.45606683918876856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) generate tokens autoregressively, with each token depending on the preceding context. This sequential nature makes the inference process inherently difficult to accelerate, posing a significant challenge for efficient deployment. In recent years, various methods have been proposed to address this issue, with the most effective approaches often involving the training of additional draft models. In this paper, we introduce READER (Retrieval-Assisted Drafter for Efficient LLM Inference), a novel lossless speculative decoding method that enhances model-based approaches by leveraging self-repetitions in the text. Our algorithm expands the speculative decoding tree using tokens obtained through statistical search. This work focuses on large batch sizes (>= 8), an underexplored yet important area for industrial applications. We also analyze the key-value (KV) cache size during speculative decoding and propose an optimization to improve performance for large batches. As a result, READER outperforms existing speculative decoding methods. Notably, READER requires no additional training and can reuse pre-trained speculator models, increasing the speedup by over 40\%. Our method demonstrates particularly strong performance on search-based tasks, such as retrieval-augmented generation, where we achieve more than 10x speedup.
Abstract（参考訳）: 大規模言語モデル(LLM)はトークンを自己回帰的に生成し、それぞれのトークンは前のコンテキストに依存する。このシーケンシャルな性質により、推論プロセスは本質的に加速が難しくなり、効率的なデプロイメントには重大な課題が生じる。近年、この問題に対処するための様々な手法が提案されているが、最も効果的なアプローチは、しばしば追加のドラフトモデルのトレーニングを含む。本稿では,テキスト中の自己反復を活用することでモデルに基づくアプローチを強化する新しい損失のない投機的復号法であるREADER(Retrieval-Assisted Drafter for Efficient LLM Inference)を紹介する。提案アルゴリズムは,統計的探索により得られたトークンを用いて投機的復号木を拡張する。この研究は、産業アプリケーションにとって未調査だが重要な領域である大規模なバッチサイズ(>=8)に焦点を当てている。また、投機復号時のキー値(KV)キャッシュサイズを分析し、大規模バッチの性能向上のための最適化を提案する。その結果、READERは既存の投機的復号法より優れている。特に、READERは追加のトレーニングを必要とせず、事前訓練された投機モデルの再利用が可能であり、スピードアップを40%以上増加させる。提案手法は,検索処理を高速化する検索強化生成など,検索ベースタスクにおいて,特に高い性能を示す。

論文の概要: READER: Retrieval-Assisted Drafter for Efficient LLM Inference

関連論文リスト