Fugu-MT 論文翻訳(概要): ConFu: Contemplate the Future for Better Speculative Sampling

論文の概要: ConFu: Contemplate the Future for Better Speculative Sampling

arxiv url: http://arxiv.org/abs/2603.08899v1
Date: Mon, 09 Mar 2026 20:11:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:23.809406
Title: ConFu: Contemplate the Future for Better Speculative Sampling
Title（参考訳）: ConFu: より優れた投機的サンプリングの未来を考える
Authors: Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun,
Abstract要約: textbfConFu (Contemplate the Future)は、新しい投機的デコーディングフレームワークで、ドラフトモデルが生成の今後の方向性を予測できるようにする。我々の研究は、投機的復号を連続的推論トークンでブリッジする最初のものであり、LSM推論を加速するための新しい方向を提供する。
参考スコア（独自算出の注目度）: 40.48053935426729
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.
Abstract（参考訳）: 投機的復号化は,大規模言語モデル(LLM)推論を高速化するための強力なアプローチとして,軽量なドラフトモデルを用いて,ターゲットモデルによって検証された候補トークンを提案する。このパラダイムの有効性は、ドラフトモデルの品質に大きく依存します。 EAGLEシリーズのような最近の進歩は最先端のスピードアップを達成しているが、既存のドラフトモデルはエラーの蓄積によって制限されている。本研究では,新しい投機的復号化フレームワークであるtextbf{ConFu} (Contemplate the Future)を提案する。 ConFu紹介一ターゲットモデルから将来指向の信号を無視可能なコストで利用できるようにするトークン及びソフトプロンプトを熟考すること。 (II) 文脈を考慮した将来の予測を可能にするMoEを用いた動的コンテンポレートトークン機構、及び三アンカートークンサンプリングと将来の予測レプリケーションを備えたトレーニングフレームワークで、堅牢な将来予測を学習する。実験によると、ConFuはLlama-3 3Bおよび8Bモデルを用いて、さまざまな下流タスクに対して、EAGLE-3よりもトークンの受け入れ率と生成速度を8--11%向上させる。我々の研究は、投機的復号化を連続的推論トークンで橋渡しし、LSM推論を加速するための新しい方向を提供する最初のものであると信じています。

論文の概要: ConFu: Contemplate the Future for Better Speculative Sampling

関連論文リスト