Fugu-MT 論文翻訳(概要): BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

論文の概要: BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

arxiv url: http://arxiv.org/abs/2605.29233v1
Date: Thu, 28 May 2026 01:48:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.583585
Title: BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference
Title（参考訳）: BlockBatch: 効率的な拡散言語モデル推論のためのマルチスケール合意デコーディング
Authors: Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu, Yingyan, Lin,
Abstract要約: 拡散言語モデルは、複数のトークン位置を並列に反復的に認知することでテキストを生成する。小さなブロックはローカル条件を保存するが、多くのデノーミングステップを必要とするが、大きなブロックはより並列性を公開するが、早期のコミットとキャッシュエラーを蓄積することができる。バッチ転送パス内で同じリクエストに対して複数のブロックサイズのブランチを実行する,トレーニング不要なオンライン推論フレームワークであるBlockBatchを提案する。
参考スコア（独自算出の注目度）: 8.885225944160021
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alternative to strictly autoregressive decoding. In practice, however, block-wise dLLM inference exposes a difficult granularity trade-off: small blocks preserve local conditioning but require many denoising steps, whereas large blocks expose more parallelism but can make premature commitments and accumulate cache error. Existing acceleration methods typically choose a single block size per request, leaving the complementarity among block sizes unused. We show that block size itself is a useful branching dimension. Different block sizes induce related but non-identical KV-cache trajectories: branches often share an initial prefix, bifurcate at semantically decisive positions, and later agree on syntactically lightweight tokens. Motivated by this structure, we propose BlockBatch, a training-free online inference framework that executes multiple block-size branches for the same request inside a batched forward pass. BlockBatch coordinates these branches through confidence-gated token merging, leader-based synchronization, and periodic full-sequence refreshes that re-anchor local block updates to a globally consistent KV state. Across 3 representative dLLMs and 4 datasets, BlockBatch reduces denoising NFEs by 26.6\% on average and achieves a 1.33$\times$ average end-to-end speedup over Fast-dLLM while preserving accuracy. These results identify block-size diversity as a practical and previously underexplored axis for branch-parallel dLLM inference.
Abstract（参考訳）: 拡散言語モデル(dLLMs)は、複数のトークン位置を並列に反復的に記述することでテキストを生成する。しかし、実際にはブロックワイドのdLLM推論は、局所的な条件を保ちながら多くのデノナイズステップを必要とするが、大きなブロックはより並列性を公開するが、早期のコミットやキャッシュエラーの蓄積が可能である。既存のアクセラレーション手法は、通常、要求毎に1ブロックサイズを選択し、ブロックサイズ間の相補性を未使用のまま残す。ブロックサイズ自体が有用な分岐次元であることを示す。ブロックサイズの違いは、関連するが同一でないKV-cache軌道を誘導する:枝はしばしば初期接頭辞を共有し、意味的に決定的な位置で分岐し、後に構文的に軽量なトークンに同意する。この構造に動機づけられたBlockBatchは、バッチ転送パス内で同じリクエストに対して複数のブロックサイズのブランチを実行する、トレーニング不要なオンライン推論フレームワークである。 BlockBatchは、信頼されたトークンのマージ、リーダベースの同期、周期的なフルシーケンス更新を通じてこれらのブランチをコーディネートし、ローカルブロックを一貫したKV状態に更新する。 3つの代表的dLLMと4つのデータセットにわたって、BlockBatchはNFEを平均26.6\%削減し、精度を維持しながらFast-dLLMよりも平均1.33$\times$エンドツーエンドのスピードアップを達成する。これらの結果は,ブロックサイズの多様性を分岐並列dLLM推定のための実用的かつ未探索の軸として同定する。

論文の概要: BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

関連論文リスト