Fugu-MT 論文翻訳(概要): Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

論文の概要: Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

arxiv url: http://arxiv.org/abs/2603.12793v1
Date: Fri, 13 Mar 2026 08:55:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.003746
Title: Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
Title（参考訳）: Cheers: セマンティック表現からパッチの詳細を分離することで、統一されたマルチモーダル理解と生成が可能になる
Authors: Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun,
Abstract要約: Cheersは、パッチレベルの詳細をセマンティック表現から切り離す、統一されたマルチモーダルモデルである。チェアは視覚的理解と生成の両方において、高度なUMMと一致または超えます。
参考スコア（独自算出の注目度）: 66.53544128707817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.
Abstract（参考訳）: マルチモーダルモデリングにおける最近の最先端のトピックは、単一のモデル内で視覚的理解と生成を統合することである。しかし、この2つのタスクはデコード方式と視覚表現のミスマッチを必要とするため、共有機能空間内で共同で最適化することは簡単ではない。本研究では,セマンティック表現からパッチレベルの詳細を分離する統一型マルチモーダルモデルであるCheersを提案する。 Cheersには3つの重要なコンポーネントが含まれている。一効率的なLCM条件付けのために、画像潜時状態を意味トークンに符号化し、圧縮する統合視覚トークン化装置 (ii)テキスト生成用自己回帰復号と画像生成用拡散復号を一体化したLLMトランス三視覚的意味論をまず復号し、次いで視覚トークン装置から意味論的に有意な詳細残差を注入し、高周波コンテンツを洗練するカスケードフローマッチングヘッド。人気のあるベンチマークの実験では、Cheersは視覚的理解と生成の両方において高度なUMMと一致するか、上回っている。 Cheersはまた、4倍のトークン圧縮を実現し、より効率的な高解像度の画像エンコーディングと生成を可能にしている。特に、Cheersは人気のあるベンチマークであるGenEvalとMMBenchでTar-1.5Bよりも優れており、トレーニングコストの20%しか必要とせず、効果的で効率的な(すなわち4xトークン圧縮)統一マルチモーダルモデリングを示している。今後の研究のために、すべてのコードとデータを公開します。

論文の概要: Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

関連論文リスト