Fugu-MT 論文翻訳(概要): Image Generators are Generalist Vision Learners

論文の概要: Image Generators are Generalist Vision Learners

arxiv url: http://arxiv.org/abs/2604.20329v1
Date: Wed, 22 Apr 2026 08:23:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:11.041202
Title: Image Generators are Generalist Vision Learners
Title（参考訳）: イメージジェネレータは一般のビジョン学習者
Authors: Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, Radu Soricut,
Abstract要約: 画像生成訓練は、言語理解や推論事前学習に類似した役割を担っていることを示す。そこで我々は,Nano Banana Pro(NBP)をトレーニングデータと視覚タスクデータを組み合わせた汎用モデルであるVision Bananaを紹介した。我々のモデルは、2次元および3次元の理解、打ち負かし、あるいは競合するゼロショット・ドメイン・スペシャリストを含む様々な視覚タスクにおいてSOTAの結果を得る。
参考スコア（独自算出の注目度）: 71.11980587450198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.
Abstract（参考訳）: 近年の研究では、画像とビデオのジェネレータがゼロショットの視覚的理解行動を示しており、LLMが言語理解と生成前訓練からの推論の創発的能力をどのように発達するかを思い出させる。視覚コンテンツを作成する能力は理解する能力を意味すると長い間推測されてきたが、生成的視覚モデルが強力な理解能力を発達した証拠は限られている。本研究では,画像生成訓練がLLMプリトレーニングに類似した役割を担っていることを実証し,様々な視覚タスクにおけるSOTAパフォーマンスを実現するために,モデルが強力で汎用的な視覚表現を学習できるようにする。そこで我々は,Nano Banana Pro (NBP) の命令チューニングによる汎用モデルであるVision Bananaを紹介した。視覚タスクの出力空間をRGB画像としてパラメータ化することにより、知覚を画像生成としてシームレスに再構成する。我々の一般モデルであるVision Bananaは、2Dと3Dの両方の理解、セグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションのセグメンテーションの2Depth Anyth Anythingシリーズを含む様々なビジョンのタスクでSOTAの結果を得る。これらの結果は,ベースモデルの画像生成能力を犠牲にすることなく,軽量な命令チューニングによって実現可能であることを示す。より優れた結果は、画像生成事前学習が一般的な視覚学習者であることを示唆している。また、画像生成は、言語理解と推論におけるテキスト生成の役割と同様に、視覚タスクの統一的で普遍的なインターフェースとして機能することを示す。生成的ビジョン事前学習は、生成と理解の両方のために基礎的ビジョンモデルを構築する上で中心的な役割を果たす。

論文の概要: Image Generators are Generalist Vision Learners

関連論文リスト