Related papers: Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

URL: http://arxiv.org/abs/2503.08805v1
Date: Tue, 11 Mar 2025 18:34:12 GMT
Title: Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining
Authors: Mikey Shechter, Yair Carmon,
Abstract summary: Filter Like You Test (FLYT) is a method for curating large-scale vision-language datasets.<n>FLYT trains a scoring model that learns to weigh each example using gradient signals from downstream tasks training sets.<n>Mixing-FLYT (M-FLYT) takes the per-example scores generated by different scoring methods and learns to unify them into a single score.
Score: 17.402771370806384
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Filter Like You Test (FLYT), a method for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example using gradient signals from downstream tasks training sets. Using the same training methodology, we develop Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods and learns to unify them into a single score. Our training methodology naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using all three methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 1.9% absolute accuracy increase over all previous results and a 5.5% increase over results that -- like us -- use only public resources.

Related papers

Language Models Improve When Pretraining Data Matches Target Tasks [8.935657480912282]
BETR is a method that selects pretraining documents based on similarity to benchmark training examples.<n>We compare data selection methods by training over 500 models spanning $1019$ to $1022$ FLOPs and fitting scaling laws to them.<n>We find that BETR achieves a 2.1x compute multiplier over DCLM-Baseline and improves performance on 9 out of 10 tasks across all scales.
arXiv Detail & Related papers (2025-07-16T17:59:45Z)
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z)
Model-agnostic Coreset Selection via LLM-based Concept Bottlenecks [6.857632954159568]
Coreset Selection (CS) identifies a subset of training data that achieves model performance comparable to using the entire dataset.<n>These scores are inefficient to compute and hard to interpret as they do not indicate whether a sample is difficult to learn in general or only for a specific model.<n>Our work proposes an interpretable score that gauges a sample's difficulty using human-understandable textual attributes (concepts) independent of any downstream model.
arXiv Detail & Related papers (2025-02-23T22:14:42Z)
Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights [11.237906163959908]
Multimodal models are trained on large-scale web-crawled datasets.<n>These datasets often contain noise, bias, and irrelevant information.<n>We propose an efficient, model-based approach using the Mimic Score.
arXiv Detail & Related papers (2025-01-12T04:28:14Z)
Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We present a framework that selects high-quality pretraining data without any LLM training of our own.<n>We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.<n>Our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM.
arXiv Detail & Related papers (2024-09-09T17:23:29Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
FairWASP: Fast and Optimal Fair Wasserstein Pre-processing [9.627848184502783]
We present FairWASP, a novel pre-processing approach to reduce disparities in classification datasets without modifying the original data. We show theoretically that integer weights are optimal, which means our method can be equivalently understood as duplicating or eliminating samples. Our work is based on reformulating the pre-processing task as a large-scale mixed-integer program (MIP), for which we propose a highly efficient algorithm based on the cutting plane method.
arXiv Detail & Related papers (2023-10-31T19:36:00Z)
Data Pruning via Moving-one-Sample-out [61.45441981346064]
We propose a novel data-pruning approach called moving-one-sample-out (MoSo) MoSo aims to identify and remove the least informative samples from the training set. Experimental results demonstrate that MoSo effectively mitigates severe performance degradation at high pruning ratios.
arXiv Detail & Related papers (2023-10-23T08:00:03Z)
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering [23.68112988933411]
This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.
arXiv Detail & Related papers (2023-09-27T19:10:43Z)
FedSampling: A Better Sampling Strategy for Federated Learning [81.85411484302952]
Federated learning (FL) is an important technique for learning models from decentralized data in a privacy-preserving way. Existing FL methods usually uniformly sample clients for local model learning in each round. We propose a novel data uniform sampling strategy for federated learning (FedSampling)
arXiv Detail & Related papers (2023-06-25T13:38:51Z)
AdaSelection: Accelerating Deep Learning Training through Data Subsampling [27.46630703428186]
We introduce AdaSelection, an adaptive sub-sampling method to identify the most informative sub-samples within each minibatch. Compared with industry-standard baselines, AdaSelection consistently displays superior performance.
arXiv Detail & Related papers (2023-06-19T07:01:28Z)
Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model. Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents. Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z)
Data Curation Alone Can Stabilize In-context Learning [20.874674130060388]
In-context learning (ICL) enables large language models to perform new tasks by prompting them with a sequence of training examples. randomly sampling examples from a training set leads to high variance in performance. We show that carefully curating a subset of training data greatly stabilizes ICL performance without any other changes to the ICL algorithm.
arXiv Detail & Related papers (2022-12-20T15:58:54Z)
A Data Cartography based MixUp for Pre-trained Language Models [47.90235939359225]
MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels. We propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples. We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks.
arXiv Detail & Related papers (2022-05-06T17:59:19Z)
Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck. We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network. We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.