Data Filtering Networks
- URL: http://arxiv.org/abs/2309.17425v3
- Date: Mon, 6 Nov 2023 02:47:51 GMT
- Title: Data Filtering Networks
- Authors: Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander
Toshev, Vaishaal Shankar
- Abstract summary: We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
- Score: 67.827994353269
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large training sets have become a cornerstone of machine learning and are the
foundation for recent advances in language modeling and multimodal learning.
While data curation for pre-training is often still ad-hoc, one common paradigm
is to first collect a massive pool of data from the Web and then filter this
candidate pool down to an actual training set via various heuristics. In this
work, we study the problem of learning a data filtering network (DFN) for this
second step of filtering a large uncurated dataset. Our key finding is that the
quality of a network for filtering is distinct from its performance on
downstream tasks: for instance, a model that performs well on ImageNet can
yield worse training sets than a model with low ImageNet accuracy that is
trained on a small amount of high-quality data. Based on our insights, we
construct new data filtering networks that induce state-of-the-art image-text
datasets. Specifically, our best performing dataset DFN-5B enables us to train
state-of-the-art CLIP models for their compute budgets: among other
improvements on a variety of tasks, a ViT-H trained on our dataset achieves
84.4% zero-shot transfer accuracy on ImageNet, out-performing models trained on
other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to
facilitate further research in dataset design, we also release a new 2 billion
example dataset DFN-2B and show that high performance data filtering networks
can be trained from scratch using only publicly available data.
Related papers
- When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models.
To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters.
We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
We propose efficient filtering methods to select relevant subsets from the pre-training dataset.
We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings.
We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
arXiv Detail & Related papers (2020-11-20T06:16:15Z) - Neural Data Server: A Large-Scale Search Engine for Transfer Learning
Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain.
NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client.
We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.