Fugu-MT 論文翻訳(概要): FOCUS: DLLMs Know How to Tame Their Compute Bound

論文の概要: FOCUS: DLLMs Know How to Tame Their Compute Bound

arxiv url: http://arxiv.org/abs/2601.23278v1
Date: Fri, 30 Jan 2026 18:52:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.622671
Title: FOCUS: DLLMs Know How to Tame Their Compute Bound
Title（参考訳）: FOCUS:DLLMは計算境界の扱い方を知っている
Authors: Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini,
Abstract要約: FOCUSは拡散大言語モデル(DLLM)のための推論システムである計算はデオード可能なトークンに焦点を合わせ、非デコーダなトークンをオンザフライで取り除く。プロダクショングレードエンジンのLMMよりも最大3.52$timesのスループット向上を実現している。
参考スコア（独自算出の注目度）: 10.298643186738799
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS -- an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: https://github.com/sands-lab/FOCUS.
Abstract（参考訳）: Diffusion Large Language Models (DLLM)はAuto-Regressiveモデルの魅力的な代替手段を提供するが、そのデプロイメントは高いデコードコストで制約される。本研究では,DLLMデコーディングにおける重要な非効率性を同定する。トークンブロック上で計算が並列化されているが,各拡散ステップでトークンの小さなサブセットだけがデオード可能であるため,ほとんどの計算が非復号トークンで無駄にされる。さらに、注意起因トークンの重要性とトークン単位の復号確率との強い相関関係を観察する。この知見に基づいて,DLLM向けに設計された推論システム FOCUS を提案する。復号化可能なトークンに動的に集中し、非復号化可能なトークンをオンザフライで削除することで、FOCUSは効率的なバッチサイズを拡大し、計算制限を緩和し、スケーラブルなスループットを実現する。実証的な評価では、FOCUSはプロダクショングレードエンジンLMDeployよりも最大3.52$\times$スループットの改善を達成し、複数のベンチマークで生成品質を維持または改善している。 FOCUSシステムはGitHubで公開されている: https://github.com/sands-lab/FOCUS。

論文の概要: FOCUS: DLLMs Know How to Tame Their Compute Bound

関連論文リスト