Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution
- URL: http://arxiv.org/abs/2412.15650v1
- Date: Fri, 20 Dec 2024 08:06:00 GMT
- Title: Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution
- Authors: Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding,
- Abstract summary: We propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers.
First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content.
Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality.
- Score: 43.07899102255169
- License:
- Abstract: Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly. A promising solution is the self-evolution strategy, where models are iteratively trained on data they generate. However, current techniques still rely on human- or GPT-annotated data and sometimes require additional models or ground truth answers. To address these issues, we propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers using only unannotated images. First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content, regenerating them if they are irrelevant or unanswerable. This sets a strong foundation for answer generation. Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality. We also use corrupted images to generate rejected answers, forming distinct preference pairs for optimization. Finally, we incorporate an image content alignment loss function alongside Direct Preference Optimization (DPO) loss to reduce hallucinations, ensuring the model focuses on image content. Experiments show that our framework performs competitively with methods using external information, offering a more efficient and scalable approach to MLLMs.
Related papers
- MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models [85.30735602813093]
Multi-Image Augmented Direct Preference Optimization (MIA-DPO) is a visual preference alignment approach that effectively handles multi-image inputs.
MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats.
arXiv Detail & Related papers (2024-10-23T07:56:48Z) - MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Class-Conditional self-reward mechanism for improved Text-to-Image models [1.8434042562191815]
We build upon the concept of self-rewarding models and introduce its vision equivalent for Text-to-Image generative AI models.
This approach works by fine-tuning diffusion model on a self-generated self-judged dataset.
It has been evaluated to be at least 60% better than existing commercial and research Text-to-image models.
arXiv Detail & Related papers (2024-05-22T09:28:43Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - Self-Enhancement Improves Text-Image Retrieval in Foundation
Visual-Language Models [33.008325765051865]
Cross-modal foundation models fail to focus on the key attributes required for domain-specific retrieval tasks.
We propose a self-enhancement framework, A3R, based on the CLIP-ViT/G-14, one of the largest cross-modal models.
arXiv Detail & Related papers (2023-06-11T14:25:38Z) - InvGAN: Invertible GANs [88.58338626299837]
InvGAN, short for Invertible GAN, successfully embeds real images to the latent space of a high quality generative model.
This allows us to perform image inpainting, merging, and online data augmentation.
arXiv Detail & Related papers (2021-12-08T21:39:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.