Related papers: The Race to Efficiency: A New Perspective on AI Scaling Laws

The Race to Efficiency: A New Perspective on AI Scaling Laws

URL: http://arxiv.org/abs/2501.02156v3
Date: Wed, 08 Jan 2025 14:26:51 GMT
Title: The Race to Efficiency: A New Perspective on AI Scaling Laws
Authors: Chien-Ping Lu,
Abstract summary: We introduce a time- and efficiency-aware framework that extends classical AI scaling laws.<n>Our model shows that, without ongoing efficiency gains, advanced performance could demand millennia of training or unrealistically large GPU fleets.<n>By formalizing this race to efficiency, we offer a quantitative roadmap for balancing front-loaded GPU investments with incremental improvements across the AI stack.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large-scale AI models expand, training becomes costlier and sustaining progress grows harder. Classical scaling laws (e.g., Kaplan et al. (2020), Hoffmann et al. (2022)) predict training loss from a static compute budget yet neglect time and efficiency, prompting the question: how can we balance ballooning GPU fleets with rapidly improving hardware and algorithms? We introduce the relative-loss equation, a time- and efficiency-aware framework that extends classical AI scaling laws. Our model shows that, without ongoing efficiency gains, advanced performance could demand millennia of training or unrealistically large GPU fleets. However, near-exponential progress remains achievable if the "efficiency-doubling rate" parallels Moore's Law. By formalizing this race to efficiency, we offer a quantitative roadmap for balancing front-loaded GPU investments with incremental improvements across the AI stack. Empirical trends suggest that sustained efficiency gains can push AI scaling well into the coming decade, providing a new perspective on the diminishing returns inherent in classical scaling.

Related papers

Scalable Thermodynamic Second-order Optimization [0.0]
We propose a scalable algorithm for employing computers to accelerate a popular second-order thermodynamic curvature called Kron-ed approximate curvature (K-FAC) Numerical experiments show that even under significant quantization noise, the benefits of second-order optimization can be preserved. We predict substantial speedups for large-scale vision and graph problems based on realistic hardware characteristics.
arXiv Detail & Related papers (2025-02-12T17:44:40Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Accelerating AI Performance using Anderson Extrapolation on GPUs [2.114333871769023]
We present a novel approach for accelerating AI performance by leveraging Anderson extrapolation.<n>By identifying the crossover point where a mixing penalty is incurred, the method focuses on reducing iterations to convergence.<n>We demonstrate significant improvements in both training and inference, motivated by scalability and efficiency extensions to the realm of high-performance computing.
arXiv Detail & Related papers (2024-10-25T10:45:17Z)
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations [62.132347451049455]
Scale has become a main ingredient in obtaining strong machine learning models. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule. We show that weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales.
arXiv Detail & Related papers (2024-05-28T17:33:54Z)
Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z)
Increased Compute Efficiency and the Diffusion of AI Capabilities [1.1838866556981258]
Training advanced AI models requires large investments in computational resources, or compute. As hardware innovation reduces the price of compute and algorithmic advances make its use more efficient, the cost of training an AI model to a given performance falls over time. We find that while an access effect increases the number of actors who can train models to a given performance over time, a performance effect simultaneously increases the performance available to each actor.
arXiv Detail & Related papers (2023-11-26T18:36:28Z)
Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance. Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z)
Towards provably efficient quantum algorithms for large-scale machine-learning models [11.440134080370811]
We show that fault-tolerant quantum computing could possibly provide provably efficient resolutions for generic (stochastic) gradient descent algorithms. We benchmark instances of large machine learning models from 7 million to 103 million parameters.
arXiv Detail & Related papers (2023-03-06T19:00:27Z)
Scaling Laws Beyond Backpropagation [64.0476282000118]
We study the ability of Direct Feedback Alignment to train causal decoder-only Transformers efficiently. We find that DFA fails to offer more efficient scaling than backpropagation.
arXiv Detail & Related papers (2022-10-26T10:09:14Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Measuring the Algorithmic Efficiency of Neural Networks [1.1108287264548806]
We show that the number of floating-point operations required to train a classifier to AlexNet-level performance has decreased by a factor of 44x between 2012 and 2019. This corresponds to algorithmic efficiency doubling every 16 months over a period of 7 years. We observe that hardware and algorithmic efficiency gains multiply and can be on a similar scale over meaningful horizons, which suggests that a good model of AI progress should integrate measures from both.
arXiv Detail & Related papers (2020-05-08T22:26:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.