Fugu-MT 論文翻訳(概要): UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

論文の概要: UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

arxiv url: http://arxiv.org/abs/2505.14682v1
Date: Tue, 20 May 2025 17:59:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-21 14:49:53.671794
Title: UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
Title（参考訳）: UniGen: 統一マルチモーダル理解と生成のための強化トレーニングとテストタイム戦略
Authors: Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, Afshin Dehghan,
Abstract要約: 画像の理解と生成が可能な統合マルチモーダル大言語モデル(MLLM)であるUniGenを紹介する。 We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, direct preference optimization。そこで我々は,テスト時間スケーリングのための新しいChain-of-Thought Verification(CoT-V)戦略を提案し,UniGenの画像生成品質を大幅に向上させる。
参考スコア（独自算出の注目度）: 52.12029029338604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.
Abstract（参考訳）: 画像の理解と生成が可能な統合マルチモーダル大言語モデル(MLLM)であるUniGenを紹介する。 We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, direct preference optimization。さらに,テスト時間スケーリングのための新しいChain-of-Thought Verification(CoT-V)戦略を提案する。特に、CoT-Vは、UniGenをテスト時に画像生成と検証の両方として動作させ、テキストプロンプトと生成された画像とのセマンティックアライメントをステップバイステップのCoT方法で評価する。 UniGenはすべてのステージにわたるオープンソースデータセットに基づいてトレーニングされており、さまざまなイメージ理解と生成ベンチマークで最先端のパフォーマンスを実現しており、最終的なスコアはGenEvalで0.78、DPG-Benchで85.19である。広範囲にわたるアブレーション研究を通じて、我々の研究は行動可能な洞察を提供し、統合MLLMの構築のライフサイクル全体において重要な課題に対処し、将来の研究に有意義な方向性をもたらす。

論文の概要: UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

関連論文リスト