To See is to Believe: Prompting GPT-4V for Better Visual Instruction
Tuning
- URL: http://arxiv.org/abs/2311.07574v2
- Date: Wed, 29 Nov 2023 15:37:24 GMT
- Title: To See is to Believe: Prompting GPT-4V for Better Visual Instruction
Tuning
- Authors: Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang
- Abstract summary: We introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions.
By simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks.
- Score: 82.34463739289892
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing visual instruction tuning methods typically prompt large language
models with textual descriptions to generate instruction-following data.
Despite the promising performance achieved, these descriptions are derived from
image annotations, which are oftentimes coarse-grained. Furthermore, the
instructions might even contradict the visual content without observing the
entire visual context. To address this challenge, we introduce a fine-grained
visual instruction dataset, LVIS-Instruct4V, which contains 220K visually
aligned and context-aware instructions produced by prompting the powerful
GPT-4V with images from LVIS. Through experimental validation and case studies,
we demonstrate that high-quality visual instructional data could improve the
performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a
wide spectrum of benchmarks by clear margins. Notably, by simply replacing the
LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA
on most challenging LMM benchmarks, e.g., LLaVA$^w$ (76.7 vs. 70.7) and MM-Vet
(40.2 vs. 35.4). We release our data and model at
https://github.com/X2FD/LVIS-INSTRUCT4V.
Related papers
- MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z) - MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment [39.407235223184195]
MM-Instruct is a large-scale dataset of diverse and high-quality visual instruction data.
It is designed to enhance the instruction-following capabilities of large multimodal models.
arXiv Detail & Related papers (2024-06-28T08:25:27Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - Prompt4Vis: Prompting Large Language Models with Example Mining and
Schema Filtering for Tabular Data Visualization [13.425454489560376]
We introduce Prompt4Vis, a framework for generating data visualization queries from natural language.
In-context learning is introduced into the text-to-vis for generating data visualization queries.
Prompt4Vis surpasses the state-of-the-art (SOTA) RGVisNet by approximately 35.9% and 71.3% on dev and test sets, respectively.
arXiv Detail & Related papers (2024-01-29T10:23:47Z) - LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images.
We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset.
We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z) - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [92.85265959892115]
This paper introduces the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction.
Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers.
To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts.
arXiv Detail & Related papers (2023-06-26T10:26:33Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.