Fugu-MT 論文翻訳(概要): Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

論文の概要: Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

arxiv url: http://arxiv.org/abs/2604.24763v1
Date: Mon, 27 Apr 2026 17:59:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.364142
Title: Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Title（参考訳）: Tuna-2:マルチモーダル理解と生成のためのビートビジョンエンコーダを内蔵したPixel
Authors: Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong,
Abstract要約: Tuna-2はネイティブな統一マルチモーダルモデルであり、ピクセルの埋め込みに基づいて視覚的理解と生成を行う。実験により、Tuna-2はマルチモーダルベンチマークで最先端のパフォーマンスを達成することが示された。
参考スコア（独自算出の注目度）: 108.23557570345356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.
Abstract（参考訳）: 統一マルチモーダルモデルは、通常、事前訓練された視覚エンコーダに依存し、理解と生成のために別々の視覚表現を使用し、2つのタスク間のミスアライメントを生成し、生のピクセルからの完全なエンドツーエンドの最適化を防ぐ。画素埋め込みに基づいて視覚的理解と生成を行うネイティブ統一マルチモーダルモデルであるTuna-2を紹介する。 Tuna-2は、単純なパッチ埋め込みレイヤを使用して視覚入力をエンコードすることで、モデルアーキテクチャを劇的に単純化し、VAEや表現エンコーダのようなモジュラビジョンエンコーダ設計を完全に破棄する。実験により、Tuna-2はマルチモーダルベンチマークで最先端のパフォーマンスを実現し、統一されたピクセル空間モデリングが高品質な画像生成のための潜在空間アプローチと完全に競合することを示した。さらに、エンコーダベースの変種は早期事前トレーニングではより高速に収束するが、Tuna-2のエンコーダフリー設計は、特にきめ細かい視覚的知覚を必要とするタスクにおいて、大規模においてより強力なマルチモーダル理解を実現する。これらの結果から,マルチモーダル・モデリングでは事前学習した視覚エンコーダは不要であり,エンド・ツー・エンドの空間学習は生成と知覚の両面において,より強力な視覚表現に向けたスケーラブルな経路を提供することが示された。

論文の概要: Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

関連論文リスト