Fugu-MT 論文翻訳(概要): Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

論文の概要: Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

arxiv url: http://arxiv.org/abs/2509.09666v1
Date: Thu, 11 Sep 2025 17:57:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-12 16:52:24.505692
Title: Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
Title（参考訳）: 真に利益を得られるか - それとも単に共存できるか?
Authors: Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan,
Abstract要約: 本稿では,画像をテキストに圧縮するエンコーダ (I2T) と,そのテキストから画像を再構成するデコーダ (T2I) による洞察に富んだパラダイムを提案する。本研究は,再構成忠実度を統一的な学習目標として用い,理解と生成プロセス間の一貫性のある双方向情報の流れを強制する。
参考スコア（独自算出の注目度）: 69.38946823657592
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.
Abstract（参考訳）: 本稿では,画像をテキストに圧縮するエンコーダ (I2T) と,そのテキストから画像を再構成するデコーダ (T2I) によるインサイトフルパラダイムを提案する。再構成の忠実度を統一的な学習目標として用い,理解と生成プロセス間の一貫性のある双方向情報の流れを強制し,相互に利得をもたらす。そこで本研究では,統一型マルチモーダル学習のための新しいフレームワークであるUAEを提案する。まず,デコーダを大規模長文画像キャプションで事前学習し,微細な意味的・複雑な空間的関係を捉える。そこで我々は,(1)エンコーダとデコーダの両方を優雅に初期化するための冷間開始フェーズ,(2)エンコーダの再構築品質を最大化し,その視覚的理解を高める情報キャプションを生成するための理解のための生成,(3)デコーダがこれらのキャプションから再構築するために洗練され,すべてのディテールを活用させ,その長文命令の追従と生成フィテリティーを向上させるための理解のための生成という,3つの段階を含む統一GRPOを提案する。評価のために,UMMの統一度を評価するための最初のベンチマークであるUnified-Benchを紹介する。 RLが進むにつれて、エンコーダは、より記述的なキャプションを自律的に生成し、デコーダは、これらの複雑な記述を理解するための深い能力を同時に示し、その結果、印象的な忠実さの再構築をもたらす。

論文の概要: Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

関連論文リスト