Fugu-MT 論文翻訳(概要): dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

論文の概要: dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

arxiv url: http://arxiv.org/abs/2512.19433v1
Date: Mon, 22 Dec 2025 14:31:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-23 18:54:32.789422
Title: dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
Title（参考訳）: dMLLM-TTS:拡散多モード大言語モデルの自己検証と効率的なテスト時間スケーリング
Authors: Yi Xin, Siqi Luo, Qi Qin, Haoxing Chen, Kaiwen Zhu, Zhiwei Zhang, Yangfan He, Rongchao Zhang, Jinbin Bai, Shuo Cao, Bin Fu, Junjun He, Yihao Liu, Yuewen Cao, Xiaohong Liu,
Abstract要約: Diffusion Multi-modal Large Language Models (dMLLMs) は画像生成と理解を統一する新しいアーキテクチャとして最近登場した。提案するdMLLM-TTSは,2つの相補的スケーリング軸上で動作し,その全生成ポテンシャルを解放する新しいフレームワークである。我々のフレームワークは線形探索の最大6倍の効率で生成品質を大幅に向上させる。
参考スコア（独自算出の注目度）: 40.03969764207708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.
Abstract（参考訳）: Diffusion Multi-modal Large Language Models (dMLLMs) は画像生成と理解を統一する新しいアーキテクチャとして最近登場した。しかし、効率よく効率的なTTS(Test-Time Scaling)手法を開発し、その完全な生成能力を解き放つことは、まだ未解決の課題である。そこで本研究では,(1)仮説の多様性を高めるための軌道探索スケーリング,(2)安定生成のための反復改良スケーリングという,2つの相補的なスケーリング軸で動作する新しいフレームワークであるdMLLM-TTSを提案する。従来のTSアプローチは2次元にわたって線形探索を行い、O(NT)のかなりの計算コストを発生させる。これらの制限を克服するために、我々は2つのイノベーションを提案する。まず,O(N+T)複雑性を持つ効率的な階層探索アルゴリズムを設計する。第2に,dMLLMの本質的な画像理解機能を活用した自己検証フィードバック機構を導入し,テキスト画像のアライメントの評価を行い,外部検証の必要性を解消する。 3種類のdMLLM(例: Lumina-DiMOO, MMaDA, Muddit)にまたがるGenEvalベンチマークの大規模な実験により、我々のフレームワークは、線形探索の最大6倍の効率で生成品質を大幅に向上することを示した。プロジェクトページ:https://github.com/Alpha-VLLM/Lumina-DiMOO

論文の概要: dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

関連論文リスト