Fugu-MT 論文翻訳(概要): Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

論文の概要: Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

arxiv url: http://arxiv.org/abs/2604.13540v1
Date: Wed, 15 Apr 2026 06:41:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.417891
Title: Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
Title（参考訳）: 統一型マルチモーダルモデルのためのフリーランチ:本質的理解による反射整流による生成の促進
Authors: Yibo Jiang, Tao Wu, Rui Jiang, Yehao Lu, Chaoxiang Cai, Zequn Qin, Xi Li,
Abstract要約: 統一マルチモーダルモデル(UMM)は、視覚的理解と生成を単一の構造に統合することを目的としている。 UMMは、その理解能力が世代を著しく上回る、顕著な能力ミスマッチを示す。そこで我々は,UniRect-CoT(UniRect-CoT)を提案する。
参考スコア（独自算出の注目度）: 20.397510070808238
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch'' hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.
Abstract（参考訳）: 統一マルチモーダルモデル(UMM)は、視覚的理解と生成を単一の構造に統合することを目的としている。しかしながら、これらのモデルは、その理解能力が世代を著しく上回る、顕著な能力ミスマッチを示す。このミスマッチは、モデルの豊富な内部知識が、タスクを理解するのに有効であるが、世代間も不活性化されていることを示している。そこで我々は,人間の「シンキング・ワイル・ドライイング」パラダイムからインスピレーションを得て,人間が継続的に反射して知識を活性化し,中間結果の修正を行う。本稿では,UniRect-CoTを提案する。提案手法は,UMMに隠された「フリーランチ」を連続的に反映し,内部知識を活性化し,生成中の中間結果を補正するものである。我々は,UMMの拡散復調過程を内在的な視覚的推論プロセスとみなし,その中間結果をモデルによって理解された目標命令と整合させ,自己監督信号として機能し,UMM生成を是正するものであり,UniRect-CoTが既存のUMMに容易に統合でき,多様な複雑なタスクにおける生成品質を大幅に向上させることができることを実証する。

論文の概要: Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

関連論文リスト