Related papers: The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

URL: http://arxiv.org/abs/2309.15954v1
Date: Wed, 27 Sep 2023 19:10:43 GMT
Title: The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
Authors: Haichao Yu, Yu Tian, Sateesh Kumar, Linjie Yang, Heng Wang
Abstract summary: This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.
Score: 23.68112988933411
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it hard to analyze and compare different data filtering approaches. DataComp is a new benchmark dedicated to evaluating different methods for data filtering. This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. We integrate existing methods and propose new solutions, such as computing CLIP score on horizontally flipped images to mitigate the interference of scene text, using vision and language models to retrieve training samples for target downstream tasks, rebalancing the data distribution to improve the efficiency of allocating the computational budget, etc. We slice and dice our design choices, provide in-depth analysis, and discuss open questions. Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.

Related papers

Joint Out-of-Distribution Filtering and Data Discovery Active Learning [44.29827026888824]
We propose Joint Out-of-distribution filtering and data Discovery Active learning (Joda). Unlike previous works, Joda is highly efficient and completely omits auxiliary models and training access to the unlabeled pool for filtering or selection. In extensive experiments on 18 configurations and 3 metrics, ours consistently achieves the highest accuracy with the best class discovery to OOD filtering balance.
arXiv Detail & Related papers (2025-03-04T10:57:24Z)
HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts [49.21764163995419]
We introduce HYPerbolic Entailment filtering (HYPE) to extract meaningful and well-aligned data from noisy image-text pair datasets. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models.
arXiv Detail & Related papers (2024-04-26T16:19:55Z)
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search [19.070305201045954]
In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. We observe that only a subset of the data in constructed datasets plays a decisive role. We introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA learning strategy for light fine-tuning.
arXiv Detail & Related papers (2024-04-16T05:29:14Z)
Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs) In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM. In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
Leveraging Image-Text Similarity and Caption Modification for the DataComp Challenge: Filtering Track and BYOD Track [9.474587055642312]
This paper presents our solution to both filtering track and BYOD track of the DataComp challenge. Our solution adopts large multimodal models CLIP and BLIP-2 to filter and modify web crawl data, and utilize external datasets along with a bag of tricks to improve the data quality.
arXiv Detail & Related papers (2023-10-23T05:40:43Z)
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z)
Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z)
CNN Filter DB: An Empirical Investigation of Trained Convolutional Filters [2.0305676256390934]
We show that model pre-training can succeed on arbitrary datasets if they meet size and variance conditions. We show that many pre-trained models contain degenerated filters which make them less robust and less suitable for fine-tuning on target applications.
arXiv Detail & Related papers (2022-03-29T08:25:42Z)
An Empirical Investigation of Model-to-Model Distribution Shifts in Trained Convolutional Filters [2.0305676256390934]
We present first empirical results from our ongoing investigation of distribution shifts in image data used for various computer vision tasks. Instead of analyzing the original training and test data, we propose to study shifts in the learned weights of trained models.
arXiv Detail & Related papers (2022-01-20T21:48:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.