SVIT: Scaling up Visual Instruction Tuning
- URL: http://arxiv.org/abs/2307.04087v3
- Date: Thu, 28 Dec 2023 16:01:36 GMT
- Title: SVIT: Scaling up Visual Instruction Tuning
- Authors: Bo Zhao, Boya Wu, Muyang He, Tiejun Huang
- Abstract summary: We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions.
Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
- Score: 26.794950789335402
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Thanks to the emerging of foundation models, the large language and vision
models are integrated to acquire the multimodal ability of visual captioning,
question answering, etc. Although existing multimodal models present impressive
performance of visual understanding and reasoning, their limits are still
largely under-explored due to the scarcity of high-quality instruction tuning
data. To push the limits of multimodal capability, we Scale up Visual
Instruction Tuning (SVIT) by constructing a dataset of 4.2 million visual
instruction tuning data including 1.6M conversation question-answer (QA) pairs,
1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed
image descriptions. Besides the volume, the proposed dataset is also featured
by the high quality and rich diversity, which is generated by prompting GPT-4
with the abundant manual annotations of images. We also propose a new data
recipe to select subset with better diversity and balance, which evokes model's
superior capabilities. Extensive experiments verify that SVIT-v1.5, trained on
the proposed dataset, outperforms state-of-the-art Multimodal Large Language
Models on popular benchmarks. The data and code are publicly available at
https://github.com/BAAI-DCAI/Visual-Instruction-Tuning.
Related papers
- VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks.
Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning [74.34171839925114]
We present MM1.5, a new family of multimodal large language models (MLLMs)
Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants.
We provide detailed insights into the training processes and decisions that inform our final designs.
arXiv Detail & Related papers (2024-09-30T17:59:34Z) - MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z) - Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models [7.056824589733873]
Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production.
Current MLLMs trained with visual-question-answering datasets could suffer from degradation.
We propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.
arXiv Detail & Related papers (2024-02-16T18:42:08Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.