Devil in the Number: Towards Robust Multi-modality Data Filter
- URL: http://arxiv.org/abs/2309.13770v1
- Date: Sun, 24 Sep 2023 22:52:35 GMT
- Title: Devil in the Number: Towards Robust Multi-modality Data Filter
- Authors: Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang
Wang
- Abstract summary: T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score.
We observe a significant proportion of redundant information, such as numbers, present in the textual content.
Our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data.
- Score: 12.33356004550808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to appropriately filter multi-modality data sets on a web-scale, it
becomes crucial to employ suitable filtering methods to boost performance and
reduce training costs. For instance, LAION papers employs the CLIP score filter
to select data with CLIP scores surpassing a certain threshold. On the other
hand, T-MARS achieves high-quality data filtering by detecting and masking text
within images and then filtering by CLIP score. Through analyzing the dataset,
we observe a significant proportion of redundant information, such as numbers,
present in the textual content. Our experiments on a subset of the data unveil
the profound impact of these redundant elements on the CLIP scores. A logical
approach would involve reevaluating the CLIP scores after eliminating these
influences. Experimentally, our text-based CLIP filter outperforms the
top-ranked method on the ``small scale" of DataComp (a data filtering
benchmark) on ImageNet distribution shifts, achieving a 3.6% performance
improvement. The results also demonstrate that our proposed text-masked filter
outperforms the original CLIP score filter when selecting the top 40% of the
data. The impact of numbers on CLIP and their handling provide valuable
insights for improving the effectiveness of CLIP training, including language
rewrite techniques.
Related papers
- Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp [13.749279800238092]
We show that image-text data filtering has biases and is value-laden.
Data relating to several imputed demographic groups are associated with higher rates of exclusion.
Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.
arXiv Detail & Related papers (2024-05-13T21:53:06Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data [36.09359953556684]
Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks.
In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt.
arXiv Detail & Related papers (2024-04-03T03:24:19Z) - Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters [38.41887207958015]
We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs)
Our filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore.
arXiv Detail & Related papers (2024-03-05T06:05:15Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - Demystifying CLIP Data [86.34045746910114]
Contrastive Language-Image Pre-training (CLIP) has advanced research and applications in computer vision.
We introduce Metadata-Curated Language-Image Pre-training (MetaCLIP)
MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution.
arXiv Detail & Related papers (2023-09-28T17:59:56Z) - The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data
Filtering [23.68112988933411]
This paper describes our learning and solution when participating in the DataComp challenge.
Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment.
Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.
arXiv Detail & Related papers (2023-09-27T19:10:43Z) - Filter Pruning for Efficient CNNs via Knowledge-driven Differential
Filter Sampler [103.97487121678276]
Filter pruning simultaneously accelerates the computation and reduces the memory overhead of CNNs.
We propose a novel Knowledge-driven Differential Filter Sampler(KDFS) with Masked Filter Modeling(MFM) framework for filter pruning.
arXiv Detail & Related papers (2023-07-01T02:28:41Z) - Zero-Shot Listwise Document Reranking with a Large Language Model [58.64141622176841]
We propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific training data.
Experiments on three TREC web search datasets demonstrate that LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but can also act as a final-stage reranker.
arXiv Detail & Related papers (2023-05-03T14:45:34Z) - Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark
of Data, Model, and Supervision [26.13829720290035]
Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision.
We propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants.
arXiv Detail & Related papers (2022-03-11T08:41:00Z) - Data Agnostic Filter Gating for Efficient Deep Networks [72.4615632234314]
Current filter pruning methods mainly leverage feature maps to generate important scores for filters and prune those with smaller scores.
In this paper, we propose a data filter pruning method that uses an auxiliary network named Dagger module to induce pruning.
In addition, to help prune filters with certain FLOPs constraints, we leverage an explicit FLOPs-aware regularization to directly promote pruning filters toward target FLOPs.
arXiv Detail & Related papers (2020-10-28T15:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.