Multi-modal preference alignment remedies regression of visual
instruction tuning on language model
- URL: http://arxiv.org/abs/2402.10884v1
- Date: Fri, 16 Feb 2024 18:42:08 GMT
- Title: Multi-modal preference alignment remedies regression of visual
instruction tuning on language model
- Authors: Shengzhi Li, Rongyu Lin, Shichao Pei
- Abstract summary: We propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset to restore language capability after visual instruction tuning.
Our findings indicate that the with DPO we are able to surpass instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99 despite small data scale.
- Score: 7.9311636400991485
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In production, multi-modal large language models (MLLMs) are expected to
support multi-turn queries of interchanging image and text modalities. However,
the current MLLMs trained with visual-question-answering (VQA) datasets could
suffer from degradation, as VQA datasets lack the diversity and complexity of
the original text instruction datasets which the underlying language model had
been trained with. To address this challenging degradation, we first collect a
lightweight (6k entries) VQA preference dataset where answers were annotated by
Gemini for 5 quality metrics in a granular fashion, and investigate standard
Supervised Fine-tuning, rejection sampling, Direct Preference Optimization
(DPO), and SteerLM. Our findings indicate that the with DPO we are able to
surpass instruction-following capabilities of the language model, achieving a
6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99 despite
small data scale. This enhancement in textual instruction proficiency
correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\%
on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks
compared to previous RLHF approach. In conclusion, we propose a
distillation-based multi-modal alignment model with fine-grained annotations on
a small dataset that reconciles the textual and visual performance of MLLMs,
restoring and boosting language capability after visual instruction tuning.
Related papers
- DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.
We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.
Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - TextSquare: Scaling up Text-Centric Visual Instruction Tuning [64.55339431760727]
We introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M.
Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs.
It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks.
arXiv Detail & Related papers (2024-04-19T11:38:08Z) - Less is More: Data Value Estimation for Visual Instruction Tuning [127.38740043393527]
We propose a new data selection approach to eliminate redundancy within visual instruction data.
Experiments on LLaVA-1.5 show that our approach using only about 7.5% data can achieve comparable performance as the full-data fine-tuned model.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - COCO is "ALL'' You Need for Visual Instruction Fine-tuning [39.438410070172125]
Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions.
Recent studies propose to construct visual IFT datasets through a multifaceted approach.
We establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions.
arXiv Detail & Related papers (2024-01-17T04:43:45Z) - Salute the Classic: Revisiting Challenges of Machine Translation in the
Age of Large Language Models [91.6543868677356]
The evolution of Neural Machine Translation has been influenced by six core challenges.
These challenges include domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search.
This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models.
arXiv Detail & Related papers (2024-01-16T13:30:09Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? [158.96530466189986]
multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks.
We investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning.
We train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.
arXiv Detail & Related papers (2023-11-29T14:08:53Z) - SVIT: Scaling up Visual Instruction Tuning [26.794950789335402]
We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions.
Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
arXiv Detail & Related papers (2023-07-09T03:25:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.