Fugu-MT 論文翻訳(概要): Representation Forcing for Bottleneck-Free Unified Multimodal Models

論文の概要: Representation Forcing for Bottleneck-Free Unified Multimodal Models

arxiv url: http://arxiv.org/abs/2605.31604v2
Date: Wed, 03 Jun 2026 10:27:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 07:07:40.435913
Title: Representation Forcing for Bottleneck-Free Unified Multimodal Models
Title（参考訳）: ボトルネックフリー統一マルチモーダルモデルの表現強制
Authors: Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu,
Abstract要約: 統一マルチモーダルモデル(UMM)は、単一モデルにおける知覚と生成を扱うことを目的としている。既存のUMMは、画像生成のために別々に訓練された凍結したVAEに依存しており、構造的なボトルネックを示唆している。本稿では,表現予測をモデルのネイティブ機能にすることで,このギャップを埋める手法であるRepresentation Forcing(RF)を提案する。
参考スコア（独自算出の注目度）: 76.99907273945493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.
Abstract（参考訳）: 統一マルチモーダルモデル(UMM)は、単一モデルにおける知覚と生成を扱うことを目的としている。しかし、既存のUMMは、画像生成のために別々に訓練された、凍結したVAEに依存しており、構造的なボトルネックを示唆している。モデルが生のピクセルから高レベルな構造と低レベルな詳細の両方を学ぶ必要があるため、ネイティブに取り除くことで品質のギャップが生じる。本稿では,表現予測をモデルのネイティブ機能にすることで,このギャップを埋める手法であるRepresentation Forcing(RF)を提案する。具体的には、RFはデコーダにピクセル前の中間トークンとして視覚表現を自動回帰的に予測させ、これらのトークンは同じバックボーン内のピクセル拡散を誘導するためにコンテキスト内に留まる。知覚出力からの表現を生成対象にすることで、RFは外部生成潜在空間の必要性を排除する。 RFは理解と生成の両面で有益である。画像生成において、RFを用いた画素空間モデルは、最先端のVAEベースの統一モデルと一致する。画像理解において、ピクセル空間RFは一般にVAEベースの変種よりも優れている。これらの結果は、エンドツーエンドでボトルネックのないUMMへの効果的なステップを提供する。

論文の概要: Representation Forcing for Bottleneck-Free Unified Multimodal Models

関連論文リスト