Fugu-MT 論文翻訳(概要): Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

論文の概要: Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

arxiv url: http://arxiv.org/abs/2511.21759v1
Date: Mon, 24 Nov 2025 13:36:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-01 19:47:55.217149
Title: Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models
Title（参考訳）: 二重境界のオーケストレーション: 拡散言語モデルのための算術的強度にインスパイアされた加速フレームワーク
Authors: Linye Wei, Wenjue Chen, Pingzhi Tang, Xiaotian Guo, Le Ye, Runsheng Wang, Meng Li,
Abstract要約: ODB-dLLMはdLLM推論を加速するために二重境界を編成するフレームワークである。我々は,ODB-dLLMがベースラインdLLMとFast-dLLMで46-162xと2.63-6.30xの高速化を達成したことを示す。
参考スコア（独自算出の注目度）: 8.516574616235427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.
Abstract（参考訳）: 拡散に基づく大規模言語モデル (dLLMs) は、最近、並列デコーディングの例外的な性能と本質的な可能性について大きな注目を集めている。既存のフレームワークは、KVキャッシュを有効にすることで、推論効率をさらに向上する。しかし、その双方向アテンション機構は、プリフィルとデコードフェーズをインターリーブする周期的なキャッシュリフレッシュを必要とし、かなりの推論コストと達成可能なスピードアップに寄与する。プリフィルと復号相の不均一な算術強度に着想を得て,dLLM推論を高速化するために二重境界を編成するフレームワークであるODB-dLLMを提案する。プリフィルの段階では、事前定義された固定応答長は、効率に影響を及ぼす重いが冗長な計算オーバーヘッドをもたらす。これを軽減するため、ODB-dLLMには適応長予測機構が組み込まれている。復号段階では,dLLMの計算特性を解析し,復号回数を減らすことで効率を向上させるために,dLLM固有のジャンプシェア投機復号法を提案する。 ODB-dLLM はベースライン dLLM と Fast-dLLM で 46-162x と 2.63-6.30x の高速化を実現し,同時に既存のアクセラレーションフレームワークの精度劣化を軽減した。

論文の概要: Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

関連論文リスト