MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
- URL: http://arxiv.org/abs/2406.19736v1
- Date: Fri, 28 Jun 2024 08:25:27 GMT
- Title: MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
- Authors: Jihao Liu, Xin Huang, Jinliang Zheng, Boxiao Liu, Jia Wang, Osamu Yoshie, Yu Liu, Hongsheng Li,
- Abstract summary: MM-Instruct is a large-scale dataset of diverse and high-quality visual instruction data.
It is designed to enhance the instruction-following capabilities of large multimodal models.
- Score: 39.407235223184195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.
Related papers
- Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z) - Generative Visual Instruction Tuning [11.727612242016871]
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model.
We produce GenLLaVA, a Generative Large Language and Visual Assistant.
Our model demonstrates visual understanding capabilities superior to LLaVA and demonstrates competitive results with native multimodal models.
arXiv Detail & Related papers (2024-06-17T07:06:58Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Towards Robust Instruction Tuning on Multimodal Large Language Models [25.506776502317436]
In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks.
Results on two popular multimodal instructionfollowing benchmarks show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks.
arXiv Detail & Related papers (2024-02-22T12:35:50Z) - Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? [158.96530466189986]
multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks.
We investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning.
We train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.
arXiv Detail & Related papers (2023-11-29T14:08:53Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.