Fugu-MT 論文翻訳(概要): Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

論文の概要: Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

arxiv url: http://arxiv.org/abs/2603.06043v1
Date: Fri, 06 Mar 2026 08:56:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.394444
Title: Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
Title（参考訳）: 理解を通して生成する学習: 統一型マルチモーダルモデルに対する理解駆動型固有リワード
Authors: Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang, Yu Sun, Hua Wu, Qingming Huang, Haifeng Wang,
Abstract要約: 統一マルチモーダルモデル(UMM)は、視覚的理解と生成の統合において顕著な進歩を遂げた。本稿では,UMMを教師と学生として同時に機能させる,トークンレベルの固有テキスト画像アライメント報酬機構GvUを提案する。提案手法により,UMMの生成が大幅に向上し,視覚的理解の微粒化が促進されることを示す。
参考スコア（独自算出の注目度）: 98.8608163448532
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals--without reliance on external supervision. Experimental results show that our method substantially boosts UMMs' generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs' visual understanding and generation.
Abstract（参考訳）: 近年、統一マルチモーダルモデル (UMM) は、視覚的理解と生成の統合において顕著な進歩を遂げており、複雑なテキスト・ツー・イメージ(T2I)タスクの強い可能性を示している。 UMMは一般的により優れた視覚的理解を示すが、比較的弱い生成能力を示す。この相違は、主に理解と生成過程の内在的な疎結合から生じる。 UMMはきめ細かな視覚的詳細を正確に解釈できるが、複雑なテキストプロンプトから意味的に一貫性のある画像を生成するのに苦労することが多い。この課題に対処するために、生成品質を高めるために、UMMの内部理解能力について検討する。本稿では,UMMが教師や学生として同時に機能する,トークンレベルの固有のテキスト画像アライメント報酬機構GvUを提案する。そこで我々は, 自己指導型強化学習フレームワークを設計し, 外部監督に頼らずに, 理解に基づく本質的な報奨信号を通じて, 生成品質を反復的に向上させることができるようにした。実験結果から,UMMの視覚的理解と生成のギャップを狭めることにより,UMMの生成を著しく向上させることが示唆された。

論文の概要: Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

関連論文リスト