Related papers: Scaling Laws for Data-Efficient Visual Transfer Learning

Scaling Laws for Data-Efficient Visual Transfer Learning

URL: http://arxiv.org/abs/2504.13219v1
Date: Thu, 17 Apr 2025 07:01:01 GMT
Title: Scaling Laws for Data-Efficient Visual Transfer Learning
Authors: Wenxuan Yang, Qingqu Wei, Chenxi Ma, Weimin Tan, Bo Yan,
Abstract summary: This paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning.<n>We propose the distillation boundary theory, revealing a critical turning point in distillation efficiency.<n>This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation.
Score: 14.114908296325277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current scaling laws for visual AI models focus predominantly on large-scale pretraining, leaving a critical gap in understanding how performance scales for data-constrained downstream tasks. To address this limitation, this paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning, addressing two fundamental questions: 1) How do scaling behaviors shift when downstream tasks operate with limited data? 2) What governs the efficacy of knowledge distillation under such constraints? Through systematic analysis of vision tasks across data regimes (1K-1M samples), we propose the distillation boundary theory, revealing a critical turning point in distillation efficiency: 1) Distillation superiority: In data-scarce conditions, distilled models significantly outperform their non-distillation counterparts, efficiently leveraging inherited knowledge to compensate for limited training samples. 2) Pre-training dominance: As pre-training data increases beyond a critical threshold, non-distilled models gradually surpass distilled versions, suggesting diminishing returns from knowledge inheritance when sufficient task-specific data becomes available. Empirical validation across various model scales (2.5M to 38M parameters) and data volumes demonstrate these performance inflection points, with error difference curves transitioning from positive to negative values at critical data thresholds, confirming our theoretical predictions. This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation, addressing a critical barrier to understanding vision model scaling behaviors and optimizing computational resource allocation.

Related papers

Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training [46.54209378000497]
Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM.<n>We propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss.<n>Our method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks.
arXiv Detail & Related papers (2025-12-25T05:40:46Z)
Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules [9.332823269318842]
Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models.<n>We establish a Functional Scaling Law that captures the full loss trajectory under arbitrary LRSs.<n>We derive explicit scaling relations in both data- and compute-limited regimes.
arXiv Detail & Related papers (2025-09-23T16:05:16Z)
How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models [17.129300781943655]
Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks.<n>Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance.<n>We propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs.
arXiv Detail & Related papers (2025-09-19T07:46:10Z)
Daunce: Data Attribution through Uncertainty Estimation [7.809316632545256]
Training data attribution methods aim to identify which training examples influence a model's predictions on specific test data most.<n> Gradient-based TDA methods rely on gradients and second-order information, limiting their applicability at scale.<n>We introduce Daunce - a simple yet effective data attribution approach through uncertainty estimation.
arXiv Detail & Related papers (2025-05-29T08:08:38Z)
Enhancing Training Data Attribution with Representational Optimization [57.61977909113113]
Training data attribution methods aim to measure how training data impacts a model's predictions.<n>We propose AirRep, a representation-based approach that closes this gap by learning task-specific and model-aligned representations explicitly for TDA.<n>AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence.
arXiv Detail & Related papers (2025-05-24T05:17:53Z)
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws [21.053622641336744]
Loss-to-loss scaling laws relate losses across pretraining datasets and downstream tasks.<n>Our experiments reveal that the pretraining data and tokenizer determine the scaling trend.
arXiv Detail & Related papers (2025-02-17T18:45:25Z)
Scaling laws in wearable human activity recognition [19.49701947129144]
Scaling laws have the potential to help move towards more principled design by linking model capacity with pre-training data volume.<n>By conducting an exhaustive grid search on both amount of pre-training data and Transformer architectures, we establish the first known scaling laws for HAR.<n>We show that these scaling laws translate to downstream performance improvements on three HAR benchmark datasets.
arXiv Detail & Related papers (2025-02-05T17:00:08Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification [34.37262622415682]
We propose a new adaptation framework called Data Adaptive Traceback. Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data. We adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning.
arXiv Detail & Related papers (2024-07-11T18:01:58Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
AST: Effective Dataset Distillation through Alignment with Smooth and High-Quality Expert Trajectories [18.266786462036553]
We propose an effective DD framework named AST, standing for Alignment with Smooth and high-quality expert Trajectories. We conduct extensive experiments on datasets of different scales, sizes, and resolutions.
arXiv Detail & Related papers (2023-10-16T16:13:53Z)
Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets. We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z)
Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision. Model compression detrimentally impacts the performance of visual prompting-based transfer. However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z)
A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws. We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z)
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks. Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z)
Provably Efficient Causal Reinforcement Learning with Confounded Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting. We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.