Fugu-MT 論文翻訳(概要): Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

論文の概要: Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

arxiv url: http://arxiv.org/abs/2506.10395v1
Date: Thu, 12 Jun 2025 06:37:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-13 15:37:22.607239
Title: Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Title（参考訳）: Pisces: イメージ理解と生成のための自動回帰基盤モデル
Authors: Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie,
Abstract要約: 統一モデルを開発する上で重要な課題は、画像理解に必要な視覚的特徴と生成の相違にある。本稿では,この課題に対処する自動回帰型マルチモーダル基盤モデルであるPiscesを紹介する。微妙なデータキュレーション、事前学習、微調整と組み合わせることで、ピッセは画像理解と画像生成の両方において競合する性能を達成する。
参考スコア（独自算出の注目度）: 81.92275347127833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩により、統合されたフレームワーク内での画像理解と生成の両方に取り組むことが可能になった。これらの利益にもかかわらず、統一モデルはどちらのタスクでも特殊モデルに比べて性能が劣ることが多い。統一モデルを開発する上での重要な課題は、画像理解と生成に必要な視覚的特徴と、各モダリティに必要な個別のトレーニングプロセスの相違にある。本稿では,この課題に対処する自動回帰型マルチモーダル基盤モデルであるPiscesを紹介する。微妙なデータキュレーション、事前学習、微調整と組み合わせることで、ピッセは画像理解と画像生成の両方において競合する性能を達成する。画像理解のための20以上の公開ベンチマークでPassesを評価し、幅広いタスクにまたがって強力なパフォーマンスを示す。さらに、画像生成のための広く採用されているベンチマークであるGenEvalでは、Passesは堅牢な生成機能を示している。画像理解と生成の相乗的関係を明らかにするとともに,個別の視覚エンコーダの利点を明らかにし,統合マルチモーダルモデルの分野を推し進める。

論文の概要: Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

関連論文リスト