Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp
- URL: http://arxiv.org/abs/2405.08209v2
- Date: Wed, 09 Oct 2024 19:34:13 GMT
- Title: Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp
- Authors: Rachel Hong, William Agnew, Tadayoshi Kohno, Jamie Morgenstern,
- Abstract summary: We show that image-text data filtering has biases and is value-laden.
Data relating to several imputed demographic groups are associated with higher rates of exclusion.
Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.
- Score: 13.749279800238092
- License:
- Abstract: As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.
Related papers
- HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts [49.21764163995419]
We introduce HYPerbolic Entailment filtering (HYPE) to extract meaningful and well-aligned data from noisy image-text pair datasets.
HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark.
This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models.
arXiv Detail & Related papers (2024-04-26T16:19:55Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data
Filtering [23.68112988933411]
This paper describes our learning and solution when participating in the DataComp challenge.
Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment.
Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.
arXiv Detail & Related papers (2023-09-27T19:10:43Z) - Devil in the Number: Towards Robust Multi-modality Data Filter [12.33356004550808]
T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score.
We observe a significant proportion of redundant information, such as numbers, present in the textual content.
Our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data.
arXiv Detail & Related papers (2023-09-24T22:52:35Z) - A Critical Analysis of Classifier Selection in Learned Bloom Filters [0.3359875577705538]
"Complexity" of the data used to build the filter might heavily impact on its performance.
We propose a novel methodology, supported by software, for designing, analyzing and implementing Learned Bloom Filters.
Experiments show that the proposed methodology and the supporting software are valid and useful.
arXiv Detail & Related papers (2022-11-28T17:17:18Z) - Message Passing in Graph Convolution Networks via Adaptive Filter Banks [81.12823274576274]
We present a novel graph convolution operator, termed BankGCN.
It decomposes multi-channel signals on graphs into subspaces and handles particular information in each subspace with an adapted filter.
It achieves excellent performance in graph classification on a collection of benchmark graph datasets.
arXiv Detail & Related papers (2021-06-18T04:23:34Z) - Data Agnostic Filter Gating for Efficient Deep Networks [72.4615632234314]
Current filter pruning methods mainly leverage feature maps to generate important scores for filters and prune those with smaller scores.
In this paper, we propose a data filter pruning method that uses an auxiliary network named Dagger module to induce pruning.
In addition, to help prune filters with certain FLOPs constraints, we leverage an explicit FLOPs-aware regularization to directly promote pruning filters toward target FLOPs.
arXiv Detail & Related papers (2020-10-28T15:26:40Z) - Adversarial Filters of Dataset Biases [96.090959788952]
Large neural models have demonstrated human-level performance on language and vision benchmarks.
Their performance degrades considerably on adversarial or out-of-distribution samples.
We propose AFLite, which adversarially filters such dataset biases.
arXiv Detail & Related papers (2020-02-10T21:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.