Fugu-MT 論文翻訳(概要): RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

論文の概要: RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

arxiv url: http://arxiv.org/abs/2604.14885v1
Date: Thu, 16 Apr 2026 11:23:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.867025
Title: RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
Title（参考訳）: RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
Authors: Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, Hai Zhao,
Abstract要約: 大規模言語モデル(LLM)における自己回帰デコーディングは、ステップ毎に1つのトークンを生成し、高い推論遅延を引き起こす。我々は,検索した正確なパターンとロジット駆動の将来の手がかりを統合する軽量でトレーニング不要な $textbfRACER を提案する。 Spec-Bench、HumanEval、MGSM-ZHの実験では、RACERは推論を継続的に加速し、自動回帰デコーディングよりも2倍以上のスピードアップを達成した。
参考スコア（独自算出の注目度）: 80.12789199134511
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2\times$ speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at $\href{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$.
Abstract（参考訳）: 大規模言語モデル(LLM)における自己回帰デコーディングは、ステップ毎に1つのトークンを生成し、高い推論遅延を引き起こす。投機的復号(SD)は推測と検証の戦略を通じてこれを緩和するが、既存のトレーニング不要な派生案はトレードオフに直面している。我々は、取得した正確なパターンとロジット駆動の将来のキューを統合する軽量でトレーニングのないメソッドである$\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decodingを提案する。この統合は信頼性の高いアンカーと柔軟な外挿の両方を提供し、より豊かな投機的ドラフトをもたらす。 Spec-Bench、HumanEval、MGSM-ZHの実験では、RACERは推論を継続的に加速し、自動回帰デコーディングよりも2ドル以上のスピードアップを達成した。我々のソースコードは$\href{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$で入手できる。

論文の概要: RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

関連論文リスト