Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
- URL: http://arxiv.org/abs/2403.02677v1
- Date: Tue, 5 Mar 2024 06:05:15 GMT
- Title: Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
- Authors: Weizhi Wang, Khalil Mrini, Linjie Yang, Sateesh Kumar, Yu Tian, Xifeng
Yan, Heng Wang
- Abstract summary: We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs)
Our filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore.
- Score: 38.41887207958015
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a novel framework for filtering image-text data by leveraging
fine-tuned Multimodal Language Models (MLMs). Our approach outperforms
predominant filtering methods (e.g., CLIPScore) via integrating the recent
advances in MLMs. We design four distinct yet complementary metrics to
holistically measure the quality of image-text data. A new pipeline is
established to construct high-quality instruction data for fine-tuning MLMs as
data filters. Comparing with CLIPScore, our MLM filters produce more precise
and comprehensive scores that directly improve the quality of filtered data and
boost the performance of pre-trained models. We achieve significant
improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2)
and various downstream tasks. Our MLM filter can generalize to different models
and tasks, and be used as a drop-in replacement for CLIPScore. An additional
ablation study is provided to verify our design choices for the MLM filter.
Related papers
- Enhancing Multilingual LLM Pretraining with Model-Based Data Selection [33.68104398807581]
We propose a model-based filtering framework for multilingual datasets.
Our approach emphasizes transparency, simplicity, and efficiency.
We extend our framework to 20 languages for which we release the refined pretraining datasets.
arXiv Detail & Related papers (2025-02-14T18:42:07Z) - CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.
We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.
We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering [2.0140381995251713]
This paper introduces an LLM-based line-level filtering method to enhance training data quality.
We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines.
To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets.
arXiv Detail & Related papers (2025-01-13T13:26:50Z) - Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation [4.518104756199573]
Molar is a sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively.
By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy.
arXiv Detail & Related papers (2024-12-24T05:23:13Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - Understanding Alignment in Multimodal LLMs: A Comprehensive Study [46.33812471516309]
We analyze each aspect of preference alignment in Multimodal Large Language Models (MLLMs)
We show that combining offline and online methods can improve the performance of the model in certain scenarios.
We introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS)
arXiv Detail & Related papers (2024-07-02T17:55:03Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.