MLLM-DataEngine: An Iterative Refinement Approach for MLLM
- URL: http://arxiv.org/abs/2308.13566v2
- Date: Mon, 11 Sep 2023 08:28:40 GMT
- Title: MLLM-DataEngine: An Iterative Refinement Approach for MLLM
- Authors: Zhiyuan Zhao, Linke Ouyang, Bin Wang, Siyuan Huang, Pan Zhang, Xiaoyi
Dong, Jiaqi Wang, Conghui He
- Abstract summary: We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
- Score: 62.30753425449056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the great advance of Multimodal Large Language Models (MLLMs) in both
instruction dataset building and benchmarking, the independence of training and
evaluation makes current MLLMs hard to further improve their capability under
the guidance of evaluation results with a relatively low human cost. In this
paper, we propose MLLM-DataEngine, a novel closed-loop system that bridges data
generation, model training, and evaluation. Within each loop iteration, the
MLLM-DataEngine first analyze the weakness of the model based on the evaluation
results, then generate a proper incremental dataset for the next training
iteration and enhance the model capability iteratively. Compared with previous
data collection methods which are separate from the benchmarking, the data
generated by MLLM-DataEngine shows better targeting, quality, and correctness.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts
the ratio of different types of data within each incremental dataset based on
the benchmarking results. For quality, we resort to GPT-4 to generate
high-quality data with each given data type. For correctness, prompt design is
critical for the data generation results. Rather than previous hand-crafted
prompt, we propose an Interactive Prompt Optimization strategy, which optimizes
the prompt with the multi-round interaction between human and GPT, and improve
the correctness of generated data greatly. Through extensive experiments, we
find our MLLM-DataEngine could boost the MLLM capability in a targeted and
automatic manner, with only a few human participation. We hope it could be a
general solution for the following MLLMs building. The MLLM-DataEngine has been
open-sourced and is now available at
https://github.com/opendatalab/MLLM-DataEngine.
Related papers
- Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - Enhancing Discriminative Tasks by Guiding the Pre-trained Language Model with Large Language Model's Experience [4.814313782484443]
Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks.
We use LLMs to generate domain-specific data, thereby improving the performance of pre-trained LMs on the target tasks.
arXiv Detail & Related papers (2024-08-16T06:37:59Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback [21.858896845159208]
Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs.
Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model.
Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset.
arXiv Detail & Related papers (2024-06-11T21:53:46Z) - COCO is "ALL'' You Need for Visual Instruction Fine-tuning [39.438410070172125]
Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions.
Recent studies propose to construct visual IFT datasets through a multifaceted approach.
We establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions.
arXiv Detail & Related papers (2024-01-17T04:43:45Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - SEED: Domain-Specific Data Curation With Large Language Models [22.54280367957015]
We present SEED, an LLM-as-compiler approach that automatically generates domain-specific data curation solutions via Large Language Models (LLMs)
SEED features an that automatically selects from the four LLM-assisted modules and forms a hybrid execution pipeline that best fits the task at hand.
arXiv Detail & Related papers (2023-10-01T17:59:20Z) - Data-Juicer: A One-Stop Data Processing System for Large Language Models [73.27731037450995]
A data recipe is a mixture of data from different sources for training Large Language Models (LLMs)
We build a new system named Data-Juicer, with which we can efficiently generate diverse data recipes.
The data recipes derived with Data-Juicer gain notable improvements on state-of-the-art LLMs.
arXiv Detail & Related papers (2023-09-05T08:22:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.