Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches
- URL: http://arxiv.org/abs/2503.01314v1
- Date: Mon, 03 Mar 2025 08:57:49 GMT
- Title: Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches
- Authors: Yifang Chen, Xuyang Guo, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song,
- Abstract summary: We show that for models with Transformer architecture, the test loss exhibits a power-law relationship with model size, dataset size, and the amount of computation used in training.<n>Our analysis provides deeper insights into the scaling law, potentially enhancing our understanding of Large Language Models.
- Score: 28.569601803576845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Large Language Models (LLMs) have achieved remarkable success. A key factor behind this success is the scaling law observed by OpenAI. Specifically, for models with Transformer architecture, the test loss exhibits a power-law relationship with model size, dataset size, and the amount of computation used in training, demonstrating trends that span more than seven orders of magnitude. This scaling law challenges traditional machine learning wisdom, notably the Oscar Scissors principle, which suggests that an overparametrized algorithm will overfit the training datasets, resulting in poor test performance. Recent research has also identified the scaling law in simpler machine learning contexts, such as linear regression. However, fully explaining the scaling law in large practical models remains an elusive goal. In this work, we advance our understanding by demonstrating that the scaling law phenomenon extends to multiple regression and kernel regression settings, which are significantly more expressive and powerful than linear methods. Our analysis provides deeper insights into the scaling law, potentially enhancing our understanding of LLMs.
Related papers
- Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
We investigate how model size, training data scale, and inference-time compute jointly influence generative retrieval performance.
Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws.
We find that LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines [20.62274005080048]
Early research established power-law relationships in model performance, leading to compute-optimal scaling strategies.
Sparse models, mixture-of-experts, retrieval-augmented learning, and multimodal models often deviate from traditional scaling patterns.
scaling behaviors vary across domains such as vision, reinforcement learning, and fine-tuning, underscoring the need for more nuanced approaches.
arXiv Detail & Related papers (2025-02-17T17:20:41Z) - Scaling Law for Stochastic Gradient Descent in Quadratically Parameterized Linear Regression [5.801904710149222]
In machine learning, the scaling law describes how the model performance improves with the model and data size scaling up.<n>This paper studies the scaling law over a linear regression with the model being quadratically parameterized.<n>As a result, in the canonical linear regression, we provide explicit separations for curves between generalization with and without feature learning, and the information-theoretical lower bound that is to parametrization method and the algorithm.
arXiv Detail & Related papers (2025-02-13T09:29:04Z) - Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.
Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.
We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z) - Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.
We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - Mixtures of Experts Unlock Parameter Scaling for Deep RL [54.26191237981469]
In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules into value-based networks results in more parameter-scalable models.
This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
arXiv Detail & Related papers (2024-02-13T17:18:56Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Inverse Scaling: When Bigger Isn't Better [80.42834197416444]
Large language models (LMs) show predictable improvements to overall loss with increased scale.
We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale.
arXiv Detail & Related papers (2023-06-15T20:11:23Z) - Reproducible scaling laws for contrastive language-image learning [42.354402731615444]
We investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.
Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks.
We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures.
arXiv Detail & Related papers (2022-12-14T10:24:50Z) - Scaling Laws for Deep Learning [1.90365714903665]
In this thesis we take a systematic approach to address the algorithmic and methodological limitations at the root of these costs.
We first demonstrate that deep learning training and pruning are predictable and governed by scaling laws.
We then show through the exploration of a noiseless realizable case that DL is in fact dominated by error sources very far from the lower error limit.
arXiv Detail & Related papers (2021-08-17T15:37:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.