MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning
- URL: http://arxiv.org/abs/2310.09478v3
- Date: Tue, 7 Nov 2023 18:25:48 GMT
- Title: MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning
- Authors: Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan
Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed
Elhoseiny
- Abstract summary: MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
- Score: 65.60607895153692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have shown their remarkable capabilities as a general
interface for various language-related applications. Motivated by this, we
target to build a unified interface for completing many vision-language tasks
including image description, visual question answering, and visual grounding,
among others. The challenge is to use a single model for performing diverse
vision-language tasks effectively with simple multi-modal instructions. Towards
this objective, we introduce MiniGPT-v2, a model that can be treated as a
unified interface for better handling various vision-language tasks. We propose
using unique identifiers for different tasks when training the model. These
identifiers enable our model to better distinguish each task instruction
effortlessly and also improve the model learning efficiency for each task.
After the three-stage training, the experimental results show that MiniGPT-v2
achieves strong performance on many visual question-answering and visual
grounding benchmarks compared to other vision-language generalist models. Our
model and codes are available at https://minigpt-v2.github.io/
Related papers
- EVLM: An Efficient Vision-Language Model for Visual Understanding [18.794601813330715]
This paper proposes an efficient multi-modal language model to minimize computational costs.
Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.
arXiv Detail & Related papers (2024-07-19T10:09:51Z) - Visual Instruction Tuning towards General-Purpose Multimodal Model: A
Survey [59.95153883166705]
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture.
Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions.
This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; and (3) the commonly used datasets in visual instruction tuning and evaluation.
arXiv Detail & Related papers (2023-12-27T14:54:37Z) - PaLI-X: On Scaling up a Multilingual Vision and Language Model [166.9837904115951]
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model.
Our model achieves new levels of performance on a wide-range of varied and complex tasks.
We observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
arXiv Detail & Related papers (2023-05-29T18:58:38Z) - Images in Language Space: Exploring the Suitability of Large Language
Models for Vision & Language Tasks [17.97052348690598]
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms.
multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models.
We make visual information accessible to the language model using separate verbalisation models.
arXiv Detail & Related papers (2023-05-23T07:50:36Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - Multitask Learning for Low Resource Spoken Language Understanding [26.106133114838215]
We train models on dual objectives with automatic speech recognition and intent classification or sentiment classification.
Our models, although being of modest size, show improvements over models trained end-to-end on intent classification.
We study the performance of the models in low-resource scenario by training the models with as few as one example per class.
arXiv Detail & Related papers (2022-11-24T16:38:17Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z) - Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.