SVIT: Scaling up Visual Instruction Tuning
- URL: http://arxiv.org/abs/2307.04087v3
- Date: Thu, 28 Dec 2023 16:01:36 GMT
- Title: SVIT: Scaling up Visual Instruction Tuning
- Authors: Bo Zhao, Boya Wu, Muyang He, Tiejun Huang
- Abstract summary: We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions.
Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
- Score: 26.794950789335402
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Thanks to the emerging of foundation models, the large language and vision
models are integrated to acquire the multimodal ability of visual captioning,
question answering, etc. Although existing multimodal models present impressive
performance of visual understanding and reasoning, their limits are still
largely under-explored due to the scarcity of high-quality instruction tuning
data. To push the limits of multimodal capability, we Scale up Visual
Instruction Tuning (SVIT) by constructing a dataset of 4.2 million visual
instruction tuning data including 1.6M conversation question-answer (QA) pairs,
1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed
image descriptions. Besides the volume, the proposed dataset is also featured
by the high quality and rich diversity, which is generated by prompting GPT-4
with the abundant manual annotations of images. We also propose a new data
recipe to select subset with better diversity and balance, which evokes model's
superior capabilities. Extensive experiments verify that SVIT-v1.5, trained on
the proposed dataset, outperforms state-of-the-art Multimodal Large Language
Models on popular benchmarks. The data and code are publicly available at
https://github.com/BAAI-DCAI/Visual-Instruction-Tuning.
Related papers
- MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z) - Less is More: Data Value Estimation for Visual Instruction Tuning [127.38740043393527]
We propose a new data selection approach to eliminate redundancy within visual instruction data.
Experiments on LLaVA-1.5 show that our approach using only about 7.5% data can achieve comparable performance as the full-data fine-tuned model.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual words, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Multi-modal preference alignment remedies regression of visual
instruction tuning on language model [7.9311636400991485]
We propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset to restore language capability after visual instruction tuning.
Our findings indicate that the with DPO we are able to surpass instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99 despite small data scale.
arXiv Detail & Related papers (2024-02-16T18:42:08Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.