Related papers: Loss-to-Loss Prediction: Scaling Laws for All Datasets

Loss-to-Loss Prediction: Scaling Laws for All Datasets

URL: http://arxiv.org/abs/2411.12925v1
Date: Tue, 19 Nov 2024 23:23:16 GMT
Title: Loss-to-Loss Prediction: Scaling Laws for All Datasets
Authors: David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham Kakade,
Abstract summary: We derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves.
Score: 17.078832037614397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.

Related papers

Distributional Training Data Attribution [20.18145179467698]
We introduce distributional training data attribution (d-TDA) to predict how the distribution of model outputs depends upon the dataset.<n>We identify training examples that drastically change the distribution of some target measurement without necessarily changing the mean.<n>We also find that influence functions (IFs) emerge naturally from our distributional framework as the limit to unrolled differentiation.
arXiv Detail & Related papers (2025-06-15T21:02:36Z)
Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples. Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance. We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z)
Ask Your Distribution Shift if Pre-Training is Right for You [74.18516460467019]
In practice, fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others. We focus on two possible failure modes of models under distribution shift: poor extrapolation and biases in the training data. Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases.
arXiv Detail & Related papers (2024-02-29T23:46:28Z)
Pre-training on Synthetic Driving Data for Trajectory Prediction [61.520225216107306]
We propose a pipeline-level solution to mitigate the issue of data scarcity in trajectory forecasting. We adopt HD map augmentation and trajectory synthesis for generating driving data, and then we learn representations by pre-training on them. We conduct extensive experiments to demonstrate the effectiveness of our data expansion and pre-training strategies.
arXiv Detail & Related papers (2023-09-18T19:49:22Z)
Boosted Dynamic Neural Networks [53.559833501288146]
A typical EDNN has multiple prediction heads at different layers of the network backbone. To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data. Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions. We formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively.
arXiv Detail & Related papers (2022-11-30T04:23:12Z)
CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z)
Delving into Sample Loss Curve to Embrace Noisy and Imbalanced Data [17.7825114228313]
Corrupted labels and class imbalance are commonly encountered in practically collected training data. Existing approaches alleviate these issues by adopting a sample re-weighting strategy. However, biased samples with corrupted labels and of tailed classes commonly co-exist in training data.
arXiv Detail & Related papers (2021-12-30T09:20:07Z)
Managing dataset shift by adversarial validation for credit scoring [5.560471251954645]
The inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance. We propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios.
arXiv Detail & Related papers (2021-12-19T07:07:15Z)
Learning to Predict Trustworthiness with Steep Slope Loss [69.40817968905495]
We study the problem of predicting trustworthiness on real-world large-scale datasets. We observe that the trustworthiness predictors trained with prior-art loss functions are prone to view both correct predictions and incorrect predictions to be trustworthy. We propose a novel steep slope loss to separate the features w.r.t. correct predictions from the ones w.r.t. incorrect predictions by two slide-like curves that oppose each other.
arXiv Detail & Related papers (2021-09-30T19:19:09Z)
Towards optimally abstaining from prediction [22.937799541125607]
A common challenge across all areas of machine learning is that training data is not distributed like test data. We consider a model where one may abstain from predicting, at a fixed cost. Our work builds on a recent abstention algorithm of Goldwasser, Kalais, and Montasser ( 2020) for transductive binary classification.
arXiv Detail & Related papers (2021-05-28T21:44:48Z)
Scaling Laws for Transfer [0.5432984841650929]
We study scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.
arXiv Detail & Related papers (2021-02-02T04:07:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.