InstructBLIP: Towards General-purpose Vision-Language Models with
Instruction Tuning
- URL: http://arxiv.org/abs/2305.06500v2
- Date: Thu, 15 Jun 2023 08:00:18 GMT
- Title: InstructBLIP: Towards General-purpose Vision-Language Models with
Instruction Tuning
- Authors: Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi
Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
- Abstract summary: We study vision-language instruction tuning based on the pretrained BLIP-2 models.
InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets.
Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks.
- Score: 43.54069813039309
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pre-training and instruction tuning have been successful at
creating general-purpose language models with broad competence. However,
building general-purpose vision-language models is challenging due to the rich
input distributions and task diversity resulting from the additional visual
input. Although vision-language pretraining has been widely studied,
vision-language instruction tuning remains under-explored. In this paper, we
conduct a systematic and comprehensive study on vision-language instruction
tuning based on the pretrained BLIP-2 models. We gather 26 publicly available
datasets, covering a wide variety of tasks and capabilities, and transform them
into instruction tuning format. Additionally, we introduce an instruction-aware
Query Transformer, which extracts informative features tailored to the given
instruction. Trained on 13 held-in datasets, InstructBLIP attains
state-of-the-art zero-shot performance across all 13 held-out datasets,
substantially outperforming BLIP-2 and larger Flamingo models. Our models also
lead to state-of-the-art performance when finetuned on individual downstream
tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts).
Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over
concurrent multimodal models. All InstructBLIP models are open-sourced at
https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
Related papers
- VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks.
Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z) - Generative Visual Instruction Tuning [11.727612242016871]
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model.
We produce GenLLaVA, a Generative Large Language and Visual Assistant.
Our model demonstrates visual understanding capabilities superior to LLaVA and demonstrates competitive results with native multimodal models.
arXiv Detail & Related papers (2024-06-17T07:06:58Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - What Matters in Training a GPT4-Style Language Model with Multimodal
Inputs? [24.676820488258336]
Large Language Models (LLMs) have displayed exceptional multi-modal capabilities in following open-ended instructions given images.
These models rely on design choices such as network structures, training data, and training strategies.
This paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models.
arXiv Detail & Related papers (2023-07-05T17:44:28Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Otter: A Multi-Modal Model with In-Context Instruction Tuning [30.804061018682244]
We introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset.
We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning.
arXiv Detail & Related papers (2023-05-05T17:59:46Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.