Related papers: Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic

Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic

URL: http://arxiv.org/abs/2404.07177v1
Date: Wed, 10 Apr 2024 17:27:54 GMT
Title: Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic
Authors: Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, J. Zico Kolter,
Abstract summary: Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. Data curation strategies are typically developed agnostic of the available compute for training. We introduce neural scaling laws that account for the non-homogeneous nature of web data.
Score: 99.3682210827572
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. In recent times, data curation has gained prominence with several works developing strategies to retain 'high-quality' subsets of 'raw' scraped data. For instance, the LAION public dataset retained only 10% of the total crawled data. However, these strategies are typically developed agnostic of the available compute for training. In this paper, we first demonstrate that making filtering decisions independent of training compute is often suboptimal: the limited high-quality data rapidly loses its utility when repeated, eventually requiring the inclusion of 'unseen' but 'lower-quality' data. To address this quality-quantity tradeoff ($\texttt{QQT}$), we introduce neural scaling laws that account for the non-homogeneous nature of web data, an angle ignored in existing literature. Our scaling laws (i) characterize the $\textit{differing}$ 'utility' of various quality subsets of web data; (ii) account for how utility diminishes for a data point at its 'nth' repetition; and (iii) formulate the mutual interaction of various data pools when combined, enabling the estimation of model performance on a combination of multiple data pools without ever jointly training on them. Our key message is that data curation $\textit{cannot}$ be agnostic of the total compute that a model will be trained for. Our scaling laws allow us to curate the best possible pool for achieving top performance on Datacomp at various compute budgets, carving out a pareto-frontier for data curation. Code is available at https://github.com/locuslab/scaling_laws_data_filtering.

Related papers

Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning [0.0]
We present a Data Shapley approach to identify a dataset's high-quality data computations, Chunked Data Shapley (C-DaSh)<n>C-DaSh scalably divides dataset into manageable chunks and estimates contribution of each chunk using optimized selection and gradient descent.<n>We empirically benchmark our method on diverse real-world classification and regression tasks, demonstrating that C-DaSh outperforms existing Shapley approximations in both computational efficiency (achieving speedups between 80x - 2300x) and accuracy in detecting low-quality data regions.
arXiv Detail & Related papers (2025-08-22T09:40:58Z)
Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality [67.67387254989018]
We study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude.
arXiv Detail & Related papers (2025-03-10T21:51:17Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Learning from Limited and Imperfect Data [6.30667368422346]
We develop practical algorithms for Deep Neural Networks that can learn from limited and imperfect data present in the real world. These works are divided into four segments, each covering a scenario of learning from limited or imperfect data.
arXiv Detail & Related papers (2024-11-11T18:48:31Z)
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach [36.47860223750303]
We consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository.
arXiv Detail & Related papers (2024-05-24T14:58:51Z)
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search [19.070305201045954]
In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. We observe that only a subset of the data in constructed datasets plays a decisive role. We introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA learning strategy for light fine-tuning.
arXiv Detail & Related papers (2024-04-16T05:29:14Z)
Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z)
Exploring Data Redundancy in Real-world Image Classification through Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs. We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data. Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z)
Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z)
CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps. We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z)
Low Budget Active Learning via Wasserstein Distance: An Integer Programming Approach [81.19737119343438]
Active learning is the process of training a model with limited labeled data by selecting a core subset of an unlabeled data pool to label. We propose a new integer optimization problem for selecting a core set that minimizes the discrete Wasserstein distance from the unlabeled pool. Our strategy requires high-quality latent features which we obtain by unsupervised learning on the unlabeled pool.
arXiv Detail & Related papers (2021-06-05T21:25:03Z)
How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets. Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.