Related papers: An Empirical Study of Scaling Laws for Transfer

An Empirical Study of Scaling Laws for Transfer

URL: http://arxiv.org/abs/2408.16947v1
Date: Fri, 30 Aug 2024 00:06:29 GMT
Title: An Empirical Study of Scaling Laws for Transfer
Authors: Matthew Barnett,
Abstract summary: We present a limited empirical study of scaling laws for transfer learning in transformer models. We examine a scaling law that incorporates a "transfer gap" term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a limited empirical study of scaling laws for transfer learning in transformer models. More specifically, we examine a scaling law that incorporates a "transfer gap" term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution. When the transfer gap is low, pre-training is a cost-effective strategy for improving downstream performance. Conversely, when the gap is high, collecting high-quality fine-tuning data becomes relatively more cost effective. Fitting the scaling law to experiments from diverse datasets reveals significant variations in the transfer gap across distributions. In theory, the scaling law can inform optimal data allocation strategies and highlights how the scarcity of downstream data can bottleneck performance. Our findings contribute to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.

Related papers

Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training [46.54209378000497]
Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM.<n>We propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss.<n>Our method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks.
arXiv Detail & Related papers (2025-12-25T05:40:46Z)
Improving Slow Transfer Predictions: Generative Methods Compared [0.33132106391262933]
This project focuses on addressing the class imbalance problem to enhance the accuracy of performance predictions.<n>We analyze and compare various augmentation strategies, including traditional oversampling methods and generative techniques.<n>We conclude that even the most advanced technique, such as CTGAN, does not significantly improve over simple stratified sampling.
arXiv Detail & Related papers (2025-12-16T15:55:53Z)
Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs [35.95748363172419]
We examine the impact of data quality and training strategies on model performance.<n>We identify high data density and non-optimal resource allocation as key factors contributing to sub-scaling.<n>We propose a sub-optimal scaling law that better predicts performance in sub-scaling regimes.
arXiv Detail & Related papers (2025-07-13T15:15:24Z)
Scaling Laws for Data-Efficient Visual Transfer Learning [14.114908296325277]
This paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning. We propose the distillation boundary theory, revealing a critical turning point in distillation efficiency. This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation.
arXiv Detail & Related papers (2025-04-17T07:01:01Z)
Scaling Laws for Downstream Task Performance of Large Language Models [28.904224842085064]
We study how the choice of the pretraining data affects downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score. With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data.
arXiv Detail & Related papers (2024-02-06T17:31:20Z)
Robust Transfer Learning with Unreliable Source Data [13.276850367115333]
We introduce a novel quantity called the ''ambiguity level'' that measures the discrepancy between the target and source regression functions. We propose a simple transfer learning procedure, and establish a general theorem that shows how this new quantity is related to the transferability of learning.
arXiv Detail & Related papers (2023-10-06T21:50:21Z)
Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets. We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z)
On Counterfactual Data Augmentation Under Confounding [30.76982059341284]
Counterfactual data augmentation has emerged as a method to mitigate confounding biases in the training data. These biases arise due to various observed and unobserved confounding variables in the data generation process. We show how our simple augmentation method helps existing state-of-the-art methods achieve good results.
arXiv Detail & Related papers (2023-05-29T16:20:23Z)
ArCL: Enhancing Contrastive Learning with Augmentation-Robust Representations [30.745749133759304]
We develop a theoretical framework to analyze the transferability of self-supervised contrastive learning. We show that contrastive learning fails to learn domain-invariant features, which limits its transferability. Based on these theoretical insights, we propose a novel method called Augmentation-robust Contrastive Learning (ArCL)
arXiv Detail & Related papers (2023-03-02T09:26:20Z)
The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift [127.21287240963859]
We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data. For a large class of linear regression instances, transfer learning with $O(N2)$ source data is as effective as supervised learning with $N$ target data.
arXiv Detail & Related papers (2022-08-03T05:59:49Z)
A Data-Based Perspective on Transfer Learning [76.30206800557411]
We take a closer look at the role of the source dataset's composition in transfer learning. Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness.
arXiv Detail & Related papers (2022-07-12T17:58:28Z)
Why Do Self-Supervised Models Transfer? Investigating the Impact of Invariance on Downstream Tasks [79.13089902898848]
Self-supervised learning is a powerful paradigm for representation learning on unlabelled images. We show that different tasks in computer vision require features to encode different (in)variances.
arXiv Detail & Related papers (2021-11-22T18:16:35Z)
Frustratingly Easy Transferability Estimation [64.42879325144439]
We propose a simple, efficient, and effective transferability measure named TransRate. TransRate measures the transferability as the mutual information between the features of target examples extracted by a pre-trained model and labels of them. Despite its extraordinary simplicity in 10 lines of codes, TransRate performs remarkably well in extensive evaluations on 22 pre-trained models and 16 downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:27:52Z)
Scaling Laws for Transfer [0.5432984841650929]
We study scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.
arXiv Detail & Related papers (2021-02-02T04:07:38Z)
Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement [56.40587594647692]
We propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED) TRED disentangles the relevant knowledge with respect to the target task from the original source model and used as a regularizer during fine-tuning the target model. Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average.
arXiv Detail & Related papers (2020-10-16T17:45:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.