Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks
- URL: http://arxiv.org/abs/2211.09808v1
- Date: Thu, 17 Nov 2022 18:59:52 GMT
- Title: Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks
- Authors: Hao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan,
Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai
- Abstract summary: We propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-gnostic tasks.
Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model.
Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
- Score: 86.66733026149892
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the remarkable success of foundation models, their task-specific
fine-tuning paradigm makes them inconsistent with the goal of general
perception modeling. The key to eliminating this inconsistency is to use
generalist models for general task modeling. However, existing attempts at
generalist models are inadequate in both versatility and performance. In this
paper, we propose Uni-Perceiver v2, which is the first generalist model capable
of handling major large-scale vision and vision-language tasks with competitive
performance. Specifically, images are encoded as general region proposals,
while texts are encoded via a Transformer-based language model. The encoded
representations are transformed by a task-agnostic decoder. Different tasks are
formulated as a unified maximum likelihood estimation problem. We further
propose an improved optimizer to ensure stable multi-task learning with an
unmixed sampling strategy, which is helpful for tasks requiring large
batch-size training. After being jointly trained on various tasks,
Uni-Perceiver v2 is capable of directly handling downstream tasks without any
task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all
existing generalist models in both versatility and performance. Meanwhile,
compared with the commonly-recognized strong baselines that require
tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance
on a broad range of vision and vision-language tasks.
Related papers
- Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models [27.45225442048711]
We introduce CCMD-8M, which overcomes the data barriers of unifying vision-centric and vision-language tasks.
We also present Griffon-G, a general large multimodal model that addresses both vision-centric and vision-language tasks within a single end-to-end paradigm.
arXiv Detail & Related papers (2024-10-21T16:30:29Z) - GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT.
GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Exposing and Addressing Cross-Task Inconsistency in Unified
Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users.
We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks.
We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z) - Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional
MoEs [63.936622239286685]
We find that interference among different tasks and modalities is the main factor to this phenomenon.
We introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models.
Code and pre-trained generalist models shall be released.
arXiv Detail & Related papers (2022-06-09T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.