InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4
- URL: http://arxiv.org/abs/2308.12067v2
- Date: Wed, 11 Oct 2023 14:49:26 GMT
- Title: InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4
- Authors: Lai Wei, Zihao Jiang, Weiran Huang, Lichao Sun
- Abstract summary: We introduce InstructionGPT-4, which is fine-tuned on a small dataset comprising only 200 examples.
Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data.
Our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal large language models to generate better output.
- Score: 14.248735997950446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models are typically trained in two stages: first
pre-training on image-text pairs, and then fine-tuning using supervised
vision-language instruction data. Recent studies have shown that large language
models can achieve satisfactory results even with a limited amount of
high-quality instruction-following data. In this paper, we introduce
InstructionGPT-4, which is fine-tuned on a small dataset comprising only 200
examples, amounting to approximately 6\% of the instruction-following data used
in the alignment dataset for MiniGPT-4. To achieve this, we first propose
several metrics to access the quality of multimodal instruction data. Based on
these metrics, we present an effective and trainable data selector to
automatically identify and filter low-quality vision-language data. By
employing this method, InstructionGPT-4 outperforms the original MiniGPT-4 on
various evaluations. Overall, our findings demonstrate that less but
high-quality instruction tuning data is efficient in enabling multimodal large
language models to generate better output. Our code is available at
https://github.com/waltonfuture/InstructionGPT-4.
Related papers
- MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models [79.0546136194314]
We present a novel instruction tuning recipe to improve the zero-shot task generalization of multimodal large language models.
We evaluate the performance of the proposed approach on 9 unseen datasets across both language and vision modalities.
arXiv Detail & Related papers (2024-11-15T20:09:59Z) - MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - Towards Robust Instruction Tuning on Multimodal Large Language Models [25.506776502317436]
In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks.
Results on two popular multimodal instructionfollowing benchmarks show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks.
arXiv Detail & Related papers (2024-02-22T12:35:50Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - Instruction Mining: Instruction Data Selection for Tuning Large Language Models [18.378654454336136]
InstructMining is designed for automatically selecting premium instruction-following data for finetuning large language models.
We show that InstructMining achieves state-of-the-art performance on two of the most popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.
arXiv Detail & Related papers (2023-07-12T16:37:31Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z) - Instruction Tuning with GPT-4 [107.55078894215798]
We present the first attempt to use GPT-4 to generate instruction-following data for finetuning large language models.
Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks.
arXiv Detail & Related papers (2023-04-06T17:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.