MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
- URL: http://arxiv.org/abs/2407.15838v2
- Date: Wed, 7 Aug 2024 09:34:25 GMT
- Title: MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
- Authors: Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai,
- Abstract summary: We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
- Score: 80.02202386597138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering and Short Visual Question Answering. To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at 1/6 the cost of manual construction. Through extensive experiment validation and ablation experiments, we demonstrate that MMInstruct could significantly improve the performance of VLLMs, e.g., the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks. The code and data shall be available at https://github.com/yuecao0119/MMInstruct.
Related papers
- Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment [39.407235223184195]
MM-Instruct is a large-scale dataset of diverse and high-quality visual instruction data.
It is designed to enhance the instruction-following capabilities of large multimodal models.
arXiv Detail & Related papers (2024-06-28T08:25:27Z) - Towards Robust Instruction Tuning on Multimodal Large Language Models [25.506776502317436]
In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks.
Results on two popular multimodal instructionfollowing benchmarks show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks.
arXiv Detail & Related papers (2024-02-22T12:35:50Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - What Makes for Good Visual Instructions? Synthesizing Complex Visual
Reasoning Instructions for Visual Instruction Tuning [115.19451843294154]
Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs)
We propose a systematic approach to automatically creating high-quality complex visual reasoning instructions.
Our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively.
arXiv Detail & Related papers (2023-11-02T15:36:12Z) - SVIT: Scaling up Visual Instruction Tuning [26.794950789335402]
We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions.
Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
arXiv Detail & Related papers (2023-07-09T03:25:14Z) - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [92.85265959892115]
This paper introduces the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction.
Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers.
To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts.
arXiv Detail & Related papers (2023-06-26T10:26:33Z) - Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation [92.2167864437497]
We propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data.
Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions; 2) it provides high-quality data for instruction tuning; and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available.
arXiv Detail & Related papers (2023-05-23T17:56:26Z) - MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction
Tuning [24.741736629886564]
Instruction tuning is a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions.
We introduce MUL-TIINSTRUCT, the first multimodal instruction tuning benchmark dataset.
We show strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from a text-only instruction dataset.
arXiv Detail & Related papers (2022-12-21T05:17:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.