Fugu-MT 論文翻訳(概要): Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

論文の概要: Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

arxiv url: http://arxiv.org/abs/2512.20573v1
Date: Tue, 23 Dec 2025 18:16:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-24 19:17:49.96409
Title: Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs
Title（参考訳）: 失敗、大勝:拡散LDMによる投機的デコーディングにおけるドラフト戦略の再考
Authors: Rui Pan, Zhuofu Chen, Ravi Netravali,
Abstract要約: 並列デコードから dLLM の速度は,コストのかかる拒絶のリスクを大幅に低下させることを示す。本稿では,dLLMに基づく投機的復号化フレームワークFailFastを提案する。
参考スコア（独自算出の注目度）: 8.881949061263784
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.4$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
Abstract（参考訳）: Diffusion Large Language Models (dLLMs) は高速で並列なトークン生成を提供するが、そのスタンドアロンの使用は本質的に効率のよいトレードオフに悩まされている。慎重に適用すれば、dLLMsの属性は、自動回帰検証器(AR)を用いた投機的復号法において、実際にプロダクタの強みとなることが示される。我々の中核的な洞察は、dLLMの並列デコードからの速度が、コストのかかる拒絶のリスクを大幅に減らし、投機的デコードによる大きなスピードアップにつながる(非常に長い)ドラフトを効果的に実現するための実践的なメカニズムを提供することである。本稿では,dLLMに基づく投機的復号化フレームワークFailFastを提案する。投機遅延を小さくするために最小限の計算に費やし、より容易なリージョンでドラフトの長さを積極的に延ばし、検証レイテンシを減らし(多くの場合、一度に70のトークンを投機して受け入れる! 微調整なしでは、FailFastはAR LLMのロスレスアクセラレーションを提供し、バニラデコーディングよりも4.9$\times$のスピードアップ、最高のネイティブなdLLMドラフトラで1.7$\times$、さまざまなモデルとワークロードで1$\times$ over EAGLE-3を実現している。 FailFastはhttps://github.com/ruipeterpan/failfast.comでオープンソース化しました。

論文の概要: Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

関連論文リスト