VIGC: Visual Instruction Generation and Correction
- URL: http://arxiv.org/abs/2308.12714v3
- Date: Sun, 4 Feb 2024 06:46:03 GMT
- Title: VIGC: Visual Instruction Generation and Correction
- Authors: Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang,
Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He
- Abstract summary: The scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge.
The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data.
This paper proposes the Visual Instruction Generation and Correction framework that enables multimodal large language models to generate instruction-tuning data.
- Score: 47.477290387002284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The integration of visual encoders and large language models (LLMs) has
driven recent progress in multimodal large language models (MLLMs). However,
the scarcity of high-quality instruction-tuning data for vision-language tasks
remains a challenge. The current leading paradigm, such as LLaVA, relies on
language-only GPT-4 to generate data, which requires pre-annotated image
captions and detection bounding boxes, suffering from understanding image
details. A practical solution to this problem would be to utilize the available
multimodal large language models (MLLMs) to generate instruction data for
vision-language tasks. However, it's worth noting that the currently accessible
MLLMs are not as powerful as their LLM counterparts, as they tend to produce
inadequate responses and generate false information. As a solution for
addressing the current issue, this paper proposes the Visual Instruction
Generation and Correction (VIGC) framework that enables multimodal large
language models to generate instruction-tuning data and progressively enhance
its quality on-the-fly. Specifically, Visual Instruction Generation (VIG)
guides the vision-language model to generate diverse instruction-tuning data.
To ensure generation quality, Visual Instruction Correction (VIC) adopts an
iterative update mechanism to correct any inaccuracies in data produced by VIG,
effectively reducing the risk of hallucination. Leveraging the diverse,
high-quality data generated by VIGC, we finetune mainstream models and validate
data quality based on various evaluations. Experimental results demonstrate
that VIGC not only compensates for the shortcomings of language-only data
generation methods, but also effectively enhances the benchmark performance.
The models, datasets, and code are available at
https://opendatalab.github.io/VIGC.
Related papers
- Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.
By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - Strategies for Improving NL-to-FOL Translation with LLMs: Data Generation, Incremental Fine-Tuning, and Verification [9.36179617282876]
We create a high-quality FOL-annotated subset of ProofWriter dataset using GPT-4o.
Our results show state-of-the-art performance for ProofWriter and ProntoQA datasets using ProofFOL on LLaMA-2 and Mistral models.
arXiv Detail & Related papers (2024-09-24T21:24:07Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning [22.93684323791136]
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.
We introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance ICCC's zero-shot performance without the need for labeled task.
Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based tasks through ICCC instruction tuning.
arXiv Detail & Related papers (2024-04-01T04:28:01Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - PiVe: Prompting with Iterative Verification Improving Graph-based Generative Capability of LLMs [28.33598529903845]
We show how a small language model could be trained to act as a verifier module for the output of an large language model.
We also show how the verifier module could apply iterative corrections offline for a more cost-effective solution to the text-to-graph generation task.
arXiv Detail & Related papers (2023-05-21T08:11:24Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.