Fugu-MT 論文翻訳(概要): A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

論文の概要: A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2511.15098v1
Date: Wed, 19 Nov 2025 04:13:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 15:51:28.6288
Title: A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
Title（参考訳）: 離散拡散に基づく多モード大言語モデルのための視覚的トークン冗長性に関する総合的研究
Authors: Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu,
Abstract要約: 我々は,異なるdMLLMアーキテクチャとタスクを用いて,視覚的トークン冗長性がどのように進化するかを検討する。本研究により, 視覚的冗長性は, 長時間のタスクを処理しながら, オフスクラッチdMLLMでのみ現れることが明らかとなった。層スキッピングはAR-to-diffusion dMLLMの加速に有効であるのに対し、プログレッシブプルーニングやレイトステッププルーニングはストクラッチdMLLMよりも効果的である。
参考スコア（独自算出の注目度）: 85.30893355216486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.
Abstract（参考訳）: 離散拡散に基づくマルチモーダル大言語モデル (dMLLM) は, 並列デコーディングと双方向コンテキストモデリングの利点により, 自己回帰型MLLMの代替として有望な存在となっている。パイオニアの研究は、鍵値キャッシュ最適化や効率的なサンプリングを通じて、モダリティ非依存の観点からこの問題を解決しようとするが、そのほとんどは、モダリティ固有の視覚トークンの冗長性を見落としている。本研究では,異なるdMLLMアーキテクチャとタスクを用いて,視覚トークンの冗長性がどのように進化するか,また,視覚トークンのプルーニングがdMLLMの応答と効率に与える影響について,包括的な研究を行う。具体的には, 視覚的冗長性は, 長時間のタスクを処理しながら, オフスクラッチdMLLMにのみ出現することを明らかにする。さらに、視覚的トークンプルーニングは、dMLLMにおける非無視情報損失を生じさせ、オフスクラッチdMLLMだけが遅延復調段階において失った情報を段階的に回復可能であることを検証した。さらに本研究では,AR-to-diffusion dMLLMの高速化が期待できるが,プログレッシブプルーニングやレイトステッププルーニングはより効果的であることを示す。全体として、この研究は、dMLLMの効率最適化に関する新たな視点を提供し、様々なマルチモーダル理解タスクにまたがる適用性を大幅に向上させた。

論文の概要: A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

関連論文リスト