Related papers: Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

URL: http://arxiv.org/abs/2403.02677v1
Date: Tue, 5 Mar 2024 06:05:15 GMT
Title: Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
Authors: Weizhi Wang, Khalil Mrini, Linjie Yang, Sateesh Kumar, Yu Tian, Xifeng Yan, Heng Wang
Abstract summary: We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs) Our filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore.
Score: 38.41887207958015
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs). Our approach outperforms predominant filtering methods (e.g., CLIPScore) via integrating the recent advances in MLMs. We design four distinct yet complementary metrics to holistically measure the quality of image-text data. A new pipeline is established to construct high-quality instruction data for fine-tuning MLMs as data filters. Comparing with CLIPScore, our MLM filters produce more precise and comprehensive scores that directly improve the quality of filtered data and boost the performance of pre-trained models. We achieve significant improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2) and various downstream tasks. Our MLM filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore. An additional ablation study is provided to verify our design choices for the MLM filter.

Related papers

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning [69.7347209018861]
We introduce MLLM-Selector, an automated approach that identifies valuable data for visual instruction tuning. We calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector.
arXiv Detail & Related papers (2025-03-26T12:42:37Z)
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection [33.68104398807581]
We propose a model-based filtering framework for multilingual datasets. Our approach emphasizes transparency, simplicity, and efficiency. We extend our framework to 20 languages for which we release the refined pretraining datasets.
arXiv Detail & Related papers (2025-02-14T18:42:07Z)
CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities. We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations. We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z)
FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering [2.0140381995251713]
This paper introduces an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets.
arXiv Detail & Related papers (2025-01-13T13:26:50Z)
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation [4.518104756199573]
Molar is a sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively. By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy.
arXiv Detail & Related papers (2024-12-24T05:23:13Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following [40.201087646516335]
We introduce MDCure, a scalable and effective fine-tuning pipeline to enhance the MD capabilities of LLMs. MDCure is based on generation of high-quality synthetic MD instruction data from sets of related articles via targeted prompts. We also introduce MDCureRM, a multi-objective reward model which filters generated data based on their training utility for MD settings.
arXiv Detail & Related papers (2024-10-30T21:08:07Z)
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering [16.03491048830499]
We present World to Code (W2C), a meticulously curated multi-modal data construction pipeline. The pipeline organizes the final generation output into a Python code format. Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks.
arXiv Detail & Related papers (2024-09-30T15:49:54Z)
Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment. Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z)
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z)
Understanding Alignment in Multimodal LLMs: A Comprehensive Study [46.33812471516309]
We analyze each aspect of preference alignment in Multimodal Large Language Models (MLLMs) We show that combining offline and online methods can improve the performance of the model in certain scenarios. We introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS)
arXiv Detail & Related papers (2024-07-02T17:55:03Z)
Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs) In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM. In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
Filter Pruning for Efficient CNNs via Knowledge-driven Differential Filter Sampler [103.97487121678276]
Filter pruning simultaneously accelerates the computation and reduces the memory overhead of CNNs. We propose a novel Knowledge-driven Differential Filter Sampler(KDFS) with Masked Filter Modeling(MFM) framework for filter pruning.
arXiv Detail & Related papers (2023-07-01T02:28:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.