Fugu-MT 論文翻訳(概要): dInfer: An Efficient Inference Framework for Diffusion Language Models

論文の概要: dInfer: An Efficient Inference Framework for Diffusion Language Models

arxiv url: http://arxiv.org/abs/2510.08666v2
Date: Mon, 13 Oct 2025 10:39:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 15:48:09.834475
Title: dInfer: An Efficient Inference Framework for Diffusion Language Models
Title（参考訳）: dInfer: 拡散言語モデルのための効率的な推論フレームワーク
Authors: Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng,
Abstract要約: 拡散に基づく大規模言語モデル (dLLM) は自己回帰(AR) LLM に代わる有望な代替品として登場した。本稿では、dLLM推論のための効率的かつ効率的なフレームワークであるdInferについて述べる。
参考スコア（独自算出の注目度）: 54.80918957287927
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components--model, diffusion iteration manager, decoding strategy, and KV-cache manager--and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers a $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared to the AR model (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with the latest vLLM inference engine, dInfer still delivers a $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.
Abstract（参考訳）: 拡散に基づく大規模言語モデル (dLLMs) は自己回帰(AR) LLM に代わる有望な代替として登場し、固有並列性を実現するためにデノナイジングベースの生成を活用している。さらに多くのオープンソースdLLMモデルが登場しているが、標準化された効率的な推論フレームワークが欠如しているため、広く採用されている。本稿では、dLLM推論のための効率的で拡張可能なフレームワークであるdInferについて述べる。 dInferは推論パイプラインをモデル、拡散イテレーションマネージャ、デコード戦略、KV-cacheマネージャという4つのモジュールコンポーネントに分解し、システムレベルの最適化とともに各コンポーネントの新しいアルゴリズムを統合する。このアルゴリズムの革新とシステム拡張の組み合わせにより、dInferはLLaDA-MoEの出力品質を損なうことなく、実質的な効率向上を達成する。バッチサイズ1では、HumanEvalで毎秒1,100トークンを超え、H800 GPUで6つのベンチマークで平均800トークンを超える。以前のシステムと比較して、dInferは同様のモデル性能を維持しながら、Fast-dLLMよりも10\times$のスピードアップを提供する。最新のvLLM推論エンジンで高度に最適化されているARモデル(アクティベーションパラメータとパフォーマンスに匹敵する数)のQWen2.5-3Bと比較しても、dInferは依然として2-$3\times$スピードアップを提供している。 dInferの実装はhttps://github.com/inclusionAI/dInferでオープンソース化されている。

論文の概要: dInfer: An Efficient Inference Framework for Diffusion Language Models

関連論文リスト