Scaling Laws for Transfer
- URL: http://arxiv.org/abs/2102.01293v1
- Date: Tue, 2 Feb 2021 04:07:38 GMT
- Title: Scaling Laws for Transfer
- Authors: Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish
- Abstract summary: We study scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting.
We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.
- Score: 0.5432984841650929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study empirical scaling laws for transfer learning between distributions
in an unsupervised, fine-tuning setting. When we train increasingly large
neural networks from-scratch on a fixed-size dataset, they eventually become
data-limited and stop improving in performance (cross-entropy loss). When we do
the same for models pre-trained on a large language dataset, the slope in
performance gains is merely reduced rather than going to zero. We calculate the
effective data "transferred" from pre-training by determining how much data a
transformer of the same size would have required to achieve the same loss when
training from scratch. In other words, we focus on units of data while holding
everything else fixed. We find that the effective data transferred is described
well in the low data regime by a power-law of parameter count and fine-tuning
dataset size. We believe the exponents in these power-laws correspond to
measures of the generality of a model and proximity of distributions (in a
directed rather than symmetric sense). We find that pre-training effectively
multiplies the fine-tuning dataset size. Transfer, like overall performance,
scales predictably in terms of parameters, data, and compute.
Related papers
- Loss-to-Loss Prediction: Scaling Laws for All Datasets [17.078832037614397]
We derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets.
Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves.
arXiv Detail & Related papers (2024-11-19T23:23:16Z) - Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z) - Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT)
We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Effect of large-scale pre-training on full and few-shot transfer
learning for natural and medical images [2.030567625639093]
We conduct large-scale pre-training on large source datasets of either natural (ImageNet-21k/1k) or medical chest X-Ray images.
We compare full and few-shot transfer using different target datasets from both natural and medical imaging domains.
Our observations provide evidence that while pre-training and transfer on closely related datasets do show clear benefit of increasing model and data size during pre-training, such benefits are not clearly visible when source and target datasets are further apart.
arXiv Detail & Related papers (2021-05-31T21:55:56Z) - Learning Invariances in Neural Networks [51.20867785006147]
We show how to parameterize a distribution over augmentations and optimize the training loss simultaneously with respect to the network parameters and augmentation parameters.
We can recover the correct set and extent of invariances on image classification, regression, segmentation, and molecular property prediction from a large space of augmentations.
arXiv Detail & Related papers (2020-10-22T17:18:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.