COCO is "ALL'' You Need for Visual Instruction Fine-tuning
- URL: http://arxiv.org/abs/2401.08968v1
- Date: Wed, 17 Jan 2024 04:43:45 GMT
- Title: COCO is "ALL'' You Need for Visual Instruction Fine-tuning
- Authors: Xiaotian Han, Yiqi Wang, Bohan Zhai, Quanzeng You, Hongxia Yang
- Abstract summary: Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions.
Recent studies propose to construct visual IFT datasets through a multifaceted approach.
We establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions.
- Score: 39.438410070172125
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal Large Language Models (MLLMs) are increasingly prominent in the
field of artificial intelligence. Visual instruction fine-tuning (IFT) is a
vital process for aligning MLLMs' output with user's intentions. High-quality
and diversified instruction following data is the key to this fine-tuning
process. Recent studies propose to construct visual IFT datasets through a
multifaceted approach: transforming existing datasets with rule-based
templates, employing GPT-4 for rewriting annotations, and utilizing GPT-4V for
visual dataset pseudo-labeling. LLaVA-1.5 adopted similar approach and
construct LLaVA-mix-665k, which is one of the simplest, most widely used, yet
most effective IFT datasets today. Notably, when properly fine-tuned with this
dataset, MLLMs can achieve state-of-the-art performance on several benchmarks.
However, we noticed that models trained with this dataset often struggle to
follow user instructions properly in multi-round dialog. In addition, tradition
caption and VQA evaluation benchmarks, with their closed-form evaluation
structure, are not fully equipped to assess the capabilities of modern
open-ended generative MLLMs. This problem is not unique to the LLaVA-mix-665k
dataset, but may be a potential issue in all IFT datasets constructed from
image captioning or VQA sources, though the extent of this issue may vary. We
argue that datasets with diverse and high-quality detailed instruction
following annotations are essential and adequate for MLLMs IFT. In this work,
we establish a new IFT dataset, with images sourced from the COCO dataset along
with more diverse instructions. Our experiments show that when fine-tuned with
out proposed dataset, MLLMs achieve better performance on open-ended evaluation
benchmarks in both single-round and multi-round dialog setting.
Related papers
- Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning [1.6570772838074355]
multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA)
Recent efforts primarily focus on scaling up training datasets through data collection and synthesis.
We propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development.
arXiv Detail & Related papers (2024-07-29T17:04:34Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps.
A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks.
MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z) - What Makes for Good Visual Instructions? Synthesizing Complex Visual
Reasoning Instructions for Visual Instruction Tuning [115.19451843294154]
Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs)
We propose a systematic approach to automatically creating high-quality complex visual reasoning instructions.
Our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively.
arXiv Detail & Related papers (2023-11-02T15:36:12Z) - SEED: Domain-Specific Data Curation With Large Language Models [22.54280367957015]
We present SEED, an LLM-as-compiler approach that automatically generates domain-specific data curation solutions via Large Language Models (LLMs)
SEED features an that automatically selects from the four LLM-assisted modules and forms a hybrid execution pipeline that best fits the task at hand.
arXiv Detail & Related papers (2023-10-01T17:59:20Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.