Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning
- URL: http://arxiv.org/abs/2402.11411v1
- Date: Sun, 18 Feb 2024 00:56:16 GMT
- Title: Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning
- Authors: Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
- Abstract summary: In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
- Score: 67.62925151837675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-following Vision Large Language Models (VLLMs) have achieved
significant progress recently on a variety of tasks. These approaches merge
strong pre-trained vision models and large language models (LLMs). Since these
components are trained separately, the learned representations need to be
aligned with joint training on additional image-language pairs. This procedure
is not perfect and can cause the model to hallucinate - provide answers that do
not accurately reflect the image, even when the core LLM is highly factual and
the vision backbone has sufficiently complete representations. In this work, we
frame the hallucination problem as an alignment issue, tackle it with
preference tuning. Specifically, we propose POVID to generate feedback data
with AI models. We use ground-truth instructions as the preferred response and
a two-stage approach to generate dispreferred data. First, we prompt GPT-4V to
inject plausible hallucinations into the correct answer. Second, we distort the
image to trigger the inherent hallucination behavior of the VLLM. This is an
automated approach, which does not rely on human data generation or require a
perfect expert, which makes it easily scalable. Finally, both of these
generation strategies are integrated into an RLHF pipeline via Direct
Preference Optimization. In experiments across broad benchmarks, we show that
we can not only reduce hallucinations, but improve model performance across
standard benchmarks, outperforming prior approaches. Our data and code are
available at https://github.com/YiyangZhou/POVID.
Related papers
- CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.
We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.
We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - Multimodal Preference Data Synthetic Alignment with Reward Model [23.978820500281213]
We propose a new framework in generating synthetic data using a reward model as a proxy of human preference for effective multimodal alignment with DPO training.
Experiment results indicate that integrating selected synthetic data, such as from generative and rewards models can effectively reduce reliance on human-annotated data.
arXiv Detail & Related papers (2024-12-23T09:29:40Z) - Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution [43.07899102255169]
We propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers.
First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content.
Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality.
arXiv Detail & Related papers (2024-12-20T08:06:00Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization [21.248617886995103]
We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time.
Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context.
arXiv Detail & Related papers (2024-11-05T01:24:37Z) - VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment [55.7956150385255]
We investigate the efficacy of AI feedback to scale supervision for aligning vision-language models.
We introduce VLFeedback, the first large-scale vision-language feedback dataset.
We train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback.
arXiv Detail & Related papers (2024-10-12T07:56:47Z) - Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Aligning Large Multimodal Models with Factually Augmented RLHF [176.54751941088819]
Large Multimodal Models (LMM) are built across modalities and misalignment between two modalities can result in "hallucination"
We adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment.
We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information.
Our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4.
arXiv Detail & Related papers (2023-09-25T20:59:33Z) - ILLUME: Rationalizing Vision-Language Models through Human Interactions [18.701950647429]
We propose a tuning paradigm based on human interactions with machine-generated data.
Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides feedback via preference selection.
This loop increases the training data and gradually carves out the VLM's rationalization capabilities that are aligned with human intent.
arXiv Detail & Related papers (2022-08-17T11:41:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.