Silkie: Preference Distillation for Large Visual Language Models
- URL: http://arxiv.org/abs/2312.10665v1
- Date: Sun, 17 Dec 2023 09:44:27 GMT
- Title: Silkie: Preference Distillation for Large Visual Language Models
- Authors: Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen,
Yazheng Yang, Benyou Wang, Lingpeng Kong
- Abstract summary: This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
- Score: 56.10697821410489
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper explores preference distillation for large vision language models
(LVLMs), improving their ability to generate helpful and faithful responses
anchoring the visual context. We first build a vision-language feedback
(VLFeedback) dataset utilizing AI annotation. Specifically, responses are
generated by models sampled from 12 LVLMs, conditioned on multi-modal
instructions sourced from various datasets. We adopt GPT-4V to assess the
generated outputs regarding helpfulness, visual faithfulness, and ethical
considerations. Furthermore, the preference supervision is distilled into
Qwen-VL-Chat through the direct preference optimization (DPO) method. The
resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME
benchmark regarding the perception and cognition capabilities, respectively.
Silkie also demonstrates reduced hallucination by setting a new
state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis
shows that DPO with our VLFeedback dataset mainly boosts the fine-grained
perception and complex cognition abilities of LVLMs, leading to more
comprehensive improvements compared to human-annotated preference datasets.
Related papers
- V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization [21.248617886995103]
We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time.
Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context.
arXiv Detail & Related papers (2024-11-05T01:24:37Z) - VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment [55.7956150385255]
We investigate the efficacy of AI feedback to scale supervision for aligning vision-language models.
We introduce VLFeedback, the first large-scale vision-language feedback dataset.
We train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback.
arXiv Detail & Related papers (2024-10-12T07:56:47Z) - MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models [115.16022378880376]
We introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench.
MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions.
Results show that all large vision-language models (LVLMs) exhibit greater improvements when augmented with images compared to textual knowledge.
arXiv Detail & Related papers (2024-10-10T17:55:02Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
SIMA is a framework that enhances visual and language modality alignment through self-improvement.
It employs an in-context self-critic mechanism to select response pairs for preference tuning.
We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities.
This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
arXiv Detail & Related papers (2024-02-18T19:26:49Z) - Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models [7.056824589733873]
Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production.
Current MLLMs trained with visual-question-answering datasets could suffer from degradation.
We propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.
arXiv Detail & Related papers (2024-02-16T18:42:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.