What Makes for Good Visual Instructions? Synthesizing Complex Visual
Reasoning Instructions for Visual Instruction Tuning
- URL: http://arxiv.org/abs/2311.01487v1
- Date: Thu, 2 Nov 2023 15:36:12 GMT
- Title: What Makes for Good Visual Instructions? Synthesizing Complex Visual
Reasoning Instructions for Visual Instruction Tuning
- Authors: Yifan Du, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Jinpeng Wang, Chuyuan
Wang, Mingchen Cai, Ruihua Song, Ji-Rong Wen
- Abstract summary: Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs)
We propose a systematic approach to automatically creating high-quality complex visual reasoning instructions.
Our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively.
- Score: 115.19451843294154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual instruction tuning is an essential approach to improving the zero-shot
generalization capability of Multi-modal Large Language Models (MLLMs). A surge
of visual instruction datasets with various focuses and characteristics have
been proposed recently, enabling MLLMs to achieve surprising results on
evaluation benchmarks. To develop more capable MLLMs, in this paper, we aim to
investigate a more fundamental question: ``what makes for good visual
instructions?''. By conducting a comprehensive empirical study, we find that
instructions focused on complex visual reasoning tasks are particularly
effective in improving the performance of MLLMs on evaluation benchmarks.
Building upon this finding, we design a systematic approach to automatically
creating high-quality complex visual reasoning instructions. Our approach
employs a synthesis-complication-reformulation paradigm, leveraging multiple
stages to gradually increase the complexity of the instructions while
guaranteeing quality. Based on this approach, we create the synthetic visual
reasoning instruction dataset consisting of 32K examples, namely ComVint, and
fine-tune four MLLMs on it. Experimental results demonstrate that our dataset
consistently enhances the performance of all the compared MLLMs, e.g.,
improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and
28.8%, respectively. Our code and data are publicly available at the link:
https://github.com/RUCAIBox/ComVint.
Related papers
- MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs [47.94710556156627]
MIA-Bench is a benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions.
Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions.
arXiv Detail & Related papers (2024-07-01T17:53:35Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Optimizing and Testing Instruction-Following: Analyzing the Impact of Fine-Grained Instruction Variants on instruction-tuned LLMs [27.321629102942754]
We introduce an effective data augmentation technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants.
Our findings show that LLMs fine-tuned with DeMoRecon will gain significant performance boost on both ours and commonly used instructions-following benchmarks.
arXiv Detail & Related papers (2024-06-17T08:08:11Z) - AvaTaR: Optimizing LLM Agents for Tool-Assisted Knowledge Retrieval [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capability in utilizing external tools and knowledge to boost accuracy and reduce hallucinations.
Here, we introduce AvaTaR, a novel framework that optimize an LLM agent to effectively use the provided tools and improve its performance on a given task/domain.
We find AvaTaR consistently outperforms state-of-the-art approaches across all four challenging tasks and exhibits strong generalization ability when applied to novel cases.
arXiv Detail & Related papers (2024-06-17T04:20:02Z) - Less is More: Data Value Estimation for Visual Instruction Tuning [127.38740043393527]
We propose a new data selection approach to eliminate redundancy within visual instruction data.
Experiments on LLaVA-1.5 show that our approach using only about 7.5% data can achieve comparable performance as the full-data fine-tuned model.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - COCO is "ALL'' You Need for Visual Instruction Fine-tuning [39.438410070172125]
Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions.
Recent studies propose to construct visual IFT datasets through a multifaceted approach.
We establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions.
arXiv Detail & Related papers (2024-01-17T04:43:45Z) - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [92.85265959892115]
This paper introduces the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction.
Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers.
To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts.
arXiv Detail & Related papers (2023-06-26T10:26:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.