Fugu-MT 論文翻訳(概要): Unified Multimodal Model as Auto-Encoder

論文の概要: Unified Multimodal Model as Auto-Encoder

arxiv url: http://arxiv.org/abs/2509.09666v2
Date: Mon, 29 Sep 2025 20:35:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:44:59.808019
Title: Unified Multimodal Model as Auto-Encoder
Title（参考訳）: オートエンコーダとしての統一マルチモーダルモデル
Authors: Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan,
Abstract要約: 本稿では,テキストに画像を圧縮するエンコーダ(I2T)と,そのテキストから画像を再構成するデコーダ(T2I)の理解に関するパラダイムを紹介する。我々の経験的結果は、理解は生成を大幅に促進し(GenEvalで検証されている)、生成は、特にきめ細かい視覚知覚を強化することを示唆している。
参考スコア（独自算出の注目度）: 69.38946823657592
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: The pursuit of unified multimodal models (UMMs) has long been hindered by a fundamental schism between multimodal understanding and generation. Current approaches typically disentangle the two and treat them as separate endeavors with disjoint objectives, missing the mutual benefits. We argue that true unification requires more than just merging two tasks. It requires a unified, foundational objective that intrinsically links them. In this paper, we introduce an insightful paradigm through the Auto-Encoder lens, i.e., regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. To implement this, we propose UAE, where we begin by pre-training the decoder with the proposed 700k long-context image-caption pairs to direct it to "understand" the fine-grained and complex semantics from the text. We then propose Unified-GRPO via reinforcement learning (RL) to unify the two, which covers two complementary stages: (1) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual perception; (2) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception like small object and color recognition (verified on MMT-Bench). This bidirectional improvement reveals a deep synergy: under the unified reconstruction objective, generation and understanding can mutually benefit each other, moving closer to truly unified multimodal intelligence.
Abstract（参考訳）: 統一マルチモーダルモデル(UMM)の追求は、長年にわたり、マルチモーダル理解と生成の基本的な分裂によって妨げられている。現在のアプローチでは、通常は2つを分離し、互いに利益を欠く、相容れない目的を持つ別々の試みとして扱う。真の統一には2つのタスクをマージする以上のものが必要だ、と私たちは主張する。本質的にリンクする統合された基礎的な目的が必要です。本稿では,テキストに画像を圧縮するエンコーダ(I2T)と,そのテキストから画像を再構成するデコーダ(T2I)の理解について,Auto-Encoderレンズによる洞察に富んだパラダイムを紹介する。そこで本研究では,700kの長文画像キャプチャペアを用いてデコーダを事前学習し,テキストから微細で複雑なセマンティクスを"理解"するためのUAEを提案する。次に,2つの相補的な段階を包含する強化学習(RL)による統一GRPOを提案する。(1)エンコーダの生成,(2)デコーダの再構築品質を最大化し,視覚的知覚を向上させる情報キャプションの生成,(2)デコーダを改良してこれらのキャプションから再構成し,すべてのディテールを活用させ,その長文命令の追従と生成フィテリティーを改善すること。我々の経験的結果は、理解は生成(GenEvalで検証される)を大幅に向上させ、生成は、特に小さな物体や色認識(MT-Benchで検証される)のようなきめ細かい視覚的知覚を強化することを示唆している。この双方向改善は、統合された再構築の目的の下では、生成と理解は相互に利益をもたらし、真の統合されたマルチモーダルインテリジェンスに近づきます。

論文の概要: Unified Multimodal Model as Auto-Encoder

関連論文リスト