Fugu-MT 論文翻訳(概要): Semantic Generative Tuning for Unified Multimodal Models

論文の概要: Semantic Generative Tuning for Unified Multimodal Models

arxiv url: http://arxiv.org/abs/2605.18714v1
Date: Mon, 18 May 2026 17:46:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:50.211072
Title: Semantic Generative Tuning for Unified Multimodal Models
Title（参考訳）: 統一マルチモーダルモデルのセマンティック生成チューニング
Authors: Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li,
Abstract要約: 統一マルチモーダルモデル(UMM)は、単一のアーキテクチャ内で視覚的理解と視覚的生成を統合する。訓練パラダイムは独立してテキスト信号を通して理解を最適化する密集したピクセルの目的を通して生成する本研究は,UMMの分離を橋渡しするための生成プロキシとして階層的視覚タスクを定式化する,生成後学習に関する最初の体系的な研究である。
参考スコア（独自算出の注目度）: 62.18894352635965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.
Abstract（参考訳）: 統一マルチモーダルモデル(UMM)は、単一のアーキテクチャ内で視覚的理解と視覚的生成を統合する。しかし、一般的な訓練パラダイムは、スパーステキスト信号による理解と、密度の高い画素目標による生成を独立に最適化する。このような分離された戦略は、不整合表現空間をもたらし、生成から視覚的理解を分離し、相互強化を妨げる。本研究は,UMMの分離を橋渡しするための生成プロキシとして階層的視覚タスクを定式化する,生成後学習に関する最初の体系的な研究である。実験により,高レベルのセマンティックタスク,特に画像セグメンテーションが最適なプロキシとして機能していることが明らかになった。テクスチャの細部でモデルを混乱させる低レベルのタスクとは異なり、セグメンテーションは視覚中心の知覚と生成的レイアウトの忠実さの両方を著しく向上させる構造的意味論を提供する。これらの知見に基づいて,セグメンテーションを生成プロキシとして活用し,マルチモーダル機能の整合と相乗化を行う新しいパラダイムであるセマンティックジェネレーティブチューニング(SGT)を紹介した。さらに、SGTは特徴線形分離性を根本的に改善し、視覚的・テクスト的アテンションアロケーションパターンを最適化することを示した。広範囲な評価により、SGTは主流ベンチマークにおけるマルチモーダル理解と生成忠実度の両方を一貫して改善することが示された。私たちのコードはhttps://song2yu.github.io/SGT/で利用可能です。

論文の概要: Semantic Generative Tuning for Unified Multimodal Models

関連論文リスト