Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
- URL: http://arxiv.org/abs/2403.02677v1
- Date: Tue, 5 Mar 2024 06:05:15 GMT
- Title: Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
- Authors: Weizhi Wang, Khalil Mrini, Linjie Yang, Sateesh Kumar, Yu Tian, Xifeng
Yan, Heng Wang
- Abstract summary: We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs)
Our filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore.
- Score: 38.41887207958015
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a novel framework for filtering image-text data by leveraging
fine-tuned Multimodal Language Models (MLMs). Our approach outperforms
predominant filtering methods (e.g., CLIPScore) via integrating the recent
advances in MLMs. We design four distinct yet complementary metrics to
holistically measure the quality of image-text data. A new pipeline is
established to construct high-quality instruction data for fine-tuning MLMs as
data filters. Comparing with CLIPScore, our MLM filters produce more precise
and comprehensive scores that directly improve the quality of filtered data and
boost the performance of pre-trained models. We achieve significant
improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2)
and various downstream tasks. Our MLM filter can generalize to different models
and tasks, and be used as a drop-in replacement for CLIPScore. An additional
ablation study is provided to verify our design choices for the MLM filter.
Related papers
- Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - MDCure: A Scalable Pipeline for Multi-Document Instruction-Following [40.201087646516335]
We introduce MDCure, a scalable and effective fine-tuning pipeline to enhance the MD capabilities of LLMs.
MDCure is based on generation of high-quality synthetic MD instruction data from sets of related articles via targeted prompts.
We also introduce MDCureRM, a multi-objective reward model which filters generated data based on their training utility for MD settings.
arXiv Detail & Related papers (2024-10-30T21:08:07Z) - World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering [16.03491048830499]
We present World to Code (W2C), a meticulously curated multi-modal data construction pipeline.
The pipeline organizes the final generation output into a Python code format.
Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks.
arXiv Detail & Related papers (2024-09-30T15:49:54Z) - Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia.
Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale.
We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z) - Understanding Alignment in Multimodal LLMs: A Comprehensive Study [46.33812471516309]
We analyze each aspect of preference alignment in Multimodal Large Language Models (MLLMs)
We show that combining offline and online methods can improve the performance of the model in certain scenarios.
We introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS)
arXiv Detail & Related papers (2024-07-02T17:55:03Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Filter Pruning for Efficient CNNs via Knowledge-driven Differential
Filter Sampler [103.97487121678276]
Filter pruning simultaneously accelerates the computation and reduces the memory overhead of CNNs.
We propose a novel Knowledge-driven Differential Filter Sampler(KDFS) with Masked Filter Modeling(MFM) framework for filter pruning.
arXiv Detail & Related papers (2023-07-01T02:28:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.