Performance Scaling via Optimal Transport: Enabling Data Selection from
Partially Revealed Sources
- URL: http://arxiv.org/abs/2307.02460v1
- Date: Wed, 5 Jul 2023 17:33:41 GMT
- Title: Performance Scaling via Optimal Transport: Enabling Data Selection from
Partially Revealed Sources
- Authors: Feiyang Kang, Hoang Anh Just, Anit Kumar Sahu, Ruoxi Jia
- Abstract summary: This paper proposes a framework called or>, which predicts model performance and supports data selection decisions based on partial samples of prospective data sources.
or> significantly improves existing performance scaling approaches in terms of both accuracy of performance inference and computation costs associated with constructing the performance.
Also, or> outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions.
- Score: 9.359395812292291
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditionally, data selection has been studied in settings where all samples
from prospective sources are fully revealed to a machine learning developer.
However, in practical data exchange scenarios, data providers often reveal only
a limited subset of samples before an acquisition decision is made. Recently,
there have been efforts to fit scaling laws that predict model performance at
any size and data source composition using the limited available samples.
However, these scaling functions are black-box, computationally expensive to
fit, highly susceptible to overfitting, or/and difficult to optimize for data
selection. This paper proposes a framework called <projektor>, which predicts
model performance and supports data selection decisions based on partial
samples of prospective data sources. Our approach distinguishes itself from
existing work by introducing a novel *two-stage* performance inference process.
In the first stage, we leverage the Optimal Transport distance to predict the
model's performance for any data mixture ratio within the range of disclosed
data sizes. In the second stage, we extrapolate the performance to larger
undisclosed data sizes based on a novel parameter-free mapping technique
inspired by neural scaling laws. We further derive an efficient gradient-based
method to select data sources based on the projected model performance.
Evaluation over a diverse range of applications demonstrates that <projektor>
significantly improves existing performance scaling approaches in terms of both
the accuracy of performance inference and the computation costs associated with
constructing the performance predictor. Also, <projektor> outperforms by a wide
margin in data selection effectiveness compared to a range of other
off-the-shelf solutions.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Target-Aware Language Modeling via Granular Data Sampling [25.957424920194914]
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources.
A cost-effective and straightforward approach is sampling with low-dimensional data features.
We show that pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
arXiv Detail & Related papers (2024-09-23T04:52:17Z) - Source-Free Domain-Invariant Performance Prediction [68.39031800809553]
We propose a source-free approach centred on uncertainty-based estimation, using a generative model for calibration in the absence of source data.
Our experiments on benchmark object recognition datasets reveal that existing source-based methods fall short with limited source sample availability.
Our approach significantly outperforms the current state-of-the-art source-free and source-based methods, affirming its effectiveness in domain-invariant performance estimation.
arXiv Detail & Related papers (2024-08-05T03:18:58Z) - Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs [18.242110417706]
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model.
We show the optimality of this approach for fine-tuning tasks under certain conditions.
Our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour.
arXiv Detail & Related papers (2024-05-05T00:08:00Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Prototypical Fine-tuning: Towards Robust Performance Under Varying Data
Sizes [47.880781811936345]
We propose a novel framework for fine-tuning pretrained language models (LM)
Our prototypical fine-tuning approach can automatically adjust the model capacity according to the number of data points and the model's inherent attributes.
arXiv Detail & Related papers (2022-11-24T14:38:08Z) - Differentiable Neural Input Search for Recommender Systems [26.88124270897381]
Differentiable Neural Input Search (DNIS) is a method that searches for mixed feature embedding dimensions in a more flexible space.
DNIS is model-agnostic and can be seamlessly incorporated with existing latent factor models for recommendation.
arXiv Detail & Related papers (2020-06-08T10:43:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.