Fugu-MT 論文翻訳(概要): TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

論文の概要: TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

arxiv url: http://arxiv.org/abs/2605.20179v1
Date: Tue, 19 May 2026 17:59:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.578879
Title: TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
Title（参考訳）: TIDE: I/O-Aware Expert Offload を用いた効率よく, 損失のない MoE 拡散 LLM 推論
Authors: Zhiben Chen, Youpeng Zhao, Yang Sui, Jun Wang, Yuzhang Shang,
Abstract要約: Diffusion Large Language Models (dLLMs) は、並列ブロックレベルのデコーディングを通じて、ハードウェア利用と双方向コンテキストを改善する。既存のARベースのメソッドは、しばしば禁止的なI/Oオーバーヘッドまたは重要な計算ボトルネックを引き起こす。本稿では,専門家のアクティベーションの時間的安定性を活用する新しい資源効率推論システムであるTIDEを提案する。
参考スコア（独自算出の注目度）: 28.278474158271894
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.
Abstract（参考訳）: Diffusion Large Language Models (dLLMs) は自動回帰(AR)モデルの競合として登場し、並列ブロックレベルのデコーディングによるハードウェア利用と双方向コンテキストの向上を実現している。しかしながら、dLLMはMix-of-experts (MoE)アーキテクチャでスケールアップを続けているため、リソース制約のあるデバイスへのデプロイメントは依然としてオープンな課題である。既存のARベースのメソッドは、しばしば禁止的なI/Oオーバーヘッドまたは重要な計算ボトルネックを引き起こす。本研究では,ブロック内の拡散過程におけるエキスパートアクティベーションの時間的安定性を生かした,資源効率の高い新しい推論システムTIDEを提案する。具体的には、ブロック内の拡散過程におけるエキスパートアクティベーションの時間的安定性を活用し、I/O方式で専門家配置を更新する間隔ベースのエキスパートリフレッシュ戦略を導入する。最適性能を確保するため,推論スケジューリングを数学的プログラミング問題として定式化し,I/OトラフィックとCPU計算を最小化する最適区間を解く。最も重要な点として、TIDEはモデルトレーニングを必要とせず、dLLM推論のための"フリーランチ"アクセラレーションを提供する、ロスレスな最適化である。 1つのGPU-CPUシステムにおいて、TIDE は LLaDA2.0-mini と LLaDA2.0-flash モデルでそれぞれ以前のベースラインよりも 1.4$\times$ と 1.5$\times$ のスループット向上を実現していることを示す。

論文の概要: TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

関連論文リスト