Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models
- URL: http://arxiv.org/abs/2402.10884v2
- Date: Tue, 05 Nov 2024 05:13:13 GMT
- Title: Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models
- Authors: Shengzhi Li, Rongyu Lin, Shichao Pei,
- Abstract summary: Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production.
Current MLLMs trained with visual-question-answering datasets could suffer from degradation.
We propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.
- Score: 7.056824589733873
- License:
- Abstract: Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets with which the underlying language model was trained. To address this degradation, we first collect a lightweight, 5k-sample VQA preference dataset where answers were annotated by Gemini for five quality metrics in a granular fashion and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO) and SteerLM algorithms. Our findings indicate that with DPO, we can surpass the instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99. This enhancement in textual instruction-following capability correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to the previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.
Related papers
- MDCure: A Scalable Pipeline for Multi-Document Instruction-Following [40.201087646516335]
We introduce MDCure, a scalable and effective fine-tuning pipeline to enhance the MD capabilities of LLMs.
MDCure is based on generation of high-quality synthetic MD instruction data from sets of related articles via targeted prompts.
We also introduce MDCureRM, a multi-objective reward model which filters generated data based on their training utility for MD settings.
arXiv Detail & Related papers (2024-10-30T21:08:07Z) - Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.
We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.
Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - COCO is "ALL'' You Need for Visual Instruction Fine-tuning [39.438410070172125]
Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions.
Recent studies propose to construct visual IFT datasets through a multifaceted approach.
We establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions.
arXiv Detail & Related papers (2024-01-17T04:43:45Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? [158.96530466189986]
multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks.
We investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning.
We train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.
arXiv Detail & Related papers (2023-11-29T14:08:53Z) - SVIT: Scaling up Visual Instruction Tuning [26.794950789335402]
We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions.
Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
arXiv Detail & Related papers (2023-07-09T03:25:14Z) - REPLUG: Retrieval-Augmented Black-Box Language Models [101.60145719119373]
REPLUG is a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model.
We show that REPLUG significantly improves the performance of GPT-3 (175B) on language modeling by 6.3%, as well as the performance of Codex on five-shot MMLU by 5.1%.
arXiv Detail & Related papers (2023-01-30T04:18:09Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.