Silkie: Preference Distillation for Large Visual Language Models
        - URL: http://arxiv.org/abs/2312.10665v1
- Date: Sun, 17 Dec 2023 09:44:27 GMT
- Title: Silkie: Preference Distillation for Large Visual Language Models
- Authors: Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen,
  Yazheng Yang, Benyou Wang, Lingpeng Kong
- Abstract summary: This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
- Score: 56.10697821410489
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   This paper explores preference distillation for large vision language models
(LVLMs), improving their ability to generate helpful and faithful responses
anchoring the visual context. We first build a vision-language feedback
(VLFeedback) dataset utilizing AI annotation. Specifically, responses are
generated by models sampled from 12 LVLMs, conditioned on multi-modal
instructions sourced from various datasets. We adopt GPT-4V to assess the
generated outputs regarding helpfulness, visual faithfulness, and ethical
considerations. Furthermore, the preference supervision is distilled into
Qwen-VL-Chat through the direct preference optimization (DPO) method. The
resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME
benchmark regarding the perception and cognition capabilities, respectively.
Silkie also demonstrates reduced hallucination by setting a new
state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis
shows that DPO with our VLFeedback dataset mainly boosts the fine-grained
perception and complex cognition abilities of LVLMs, leading to more
comprehensive improvements compared to human-annotated preference datasets.
 
      
        Related papers
        - Painting with Words: Elevating Detailed Image Captioning with Benchmark   and Alignment Learning [56.31096024472269]
 We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.
DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.
DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
 arXiv  Detail & Related papers  (2025-03-10T22:53:56Z)
- Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide   Data Synthesis [55.65459867300319]
 LLMs demonstrate remarkable capabilities in following natural language instructions, largely due to instruction-tuning on high-quality datasets.
Recent approaches incorporate feedback to improve data quality, but typically operate at the sample level, generating and applying feedback for each response individually.
We propose Reference-Level Feedback, a novel methodology that instead collects feedback based on high-quality reference samples from carefully curated seed data.
 arXiv  Detail & Related papers  (2025-02-06T21:29:00Z)
- Multimodal Preference Data Synthetic Alignment with Reward Model [23.978820500281213]
 We propose a new framework in generating synthetic data using a reward model as a proxy of human preference for effective multimodal alignment with DPO training.
Experiment results indicate that integrating selected synthetic data, such as from generative and rewards models can effectively reduce reliance on human-annotated data.
 arXiv  Detail & Related papers  (2024-12-23T09:29:40Z)
- EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation [58.546205554954454]
 We propose Enhancing Alignment in MLLMs via Critical Observation (EACO)
EACO aligns MLLMs by self-generated preference data using only 5k images economically.
EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition.
 arXiv  Detail & Related papers  (2024-12-06T09:59:47Z)
- V-DPO: Mitigating Hallucination in Large Vision Language Models via   Vision-Guided Direct Preference Optimization [21.248617886995103]
 We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time.
Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context.
 arXiv  Detail & Related papers  (2024-11-05T01:24:37Z)
- VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language   Models Alignment [55.7956150385255]
 We investigate the efficacy of AI feedback to scale supervision for aligning vision-language models.
We introduce VLFeedback, the first large-scale vision-language feedback dataset.
We train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback.
 arXiv  Detail & Related papers  (2024-10-12T07:56:47Z)
- MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal   Models [115.16022378880376]
 We introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench.
MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions.
Results show that all large vision-language models (LVLMs) exhibit greater improvements when augmented with images compared to textual knowledge.
 arXiv  Detail & Related papers  (2024-10-10T17:55:02Z)
- Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
 We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
 arXiv  Detail & Related papers  (2024-06-06T18:01:02Z)
- RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness [102.06442250444618]
 We introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm.
 RLAIF-V maximally explores open-source MLLMs from two perspectives, including high-quality feedback data generation.
Experiments on six benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models.
 arXiv  Detail & Related papers  (2024-05-27T14:37:01Z)
- Enhancing Visual-Language Modality Alignment in Large Vision Language   Models via Self-Improvement [102.22911097049953]
 SIMA is a framework that enhances visual and language modality alignment through self-improvement.
It employs an in-context self-critic mechanism to select response pairs for preference tuning.
We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
 arXiv  Detail & Related papers  (2024-05-24T23:09:27Z)
- Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
 Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning.
LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image.
We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
 arXiv  Detail & Related papers  (2024-05-23T14:30:33Z)
- VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large   Vision-Language Models [57.43276586087863]
 Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
 arXiv  Detail & Related papers  (2024-04-22T04:49:22Z)
- ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language   Models [45.040292339670096]
 Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities.
This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
 arXiv  Detail & Related papers  (2024-02-18T19:26:49Z)
- Multi-modal Preference Alignment Remedies Degradation of Visual   Instruction Tuning on Language Models [7.056824589733873]
 Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production.
Current MLLMs trained with visual-question-answering datasets could suffer from degradation.
We propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.
 arXiv  Detail & Related papers  (2024-02-16T18:42:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.