Related papers: Fast and Accurate Model Scaling

Fast and Accurate Model Scaling

URL: http://arxiv.org/abs/2103.06877v1
Date: Thu, 11 Mar 2021 18:59:14 GMT
Title: Fast and Accurate Model Scaling
Authors: Piotr Doll\'ar and Mannat Singh and Ross Girshick
Abstract summary: scaling strategies may include increasing model width, depth, resolution, etc. We show that various scaling strategies affect model parameters, activations, and consequently actual runtime quite differently. Unlike currently popular scaling strategies, which result in about $O(sqrts)$ increase in model activation w.r.t., the proposed fast compound scaling results in close to $O(sqrts)$ increase in activations, while achieving excellent accuracy.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power. Example scaling strategies may include increasing model width, depth, resolution, etc. While various scaling strategies exist, their tradeoffs are not fully understood. Existing analysis typically focuses on the interplay of accuracy and flops (floating point operations). Yet, as we demonstrate, various scaling strategies affect model parameters, activations, and consequently actual runtime quite differently. In our experiments we show the surprising result that numerous scaling strategies yield networks with similar accuracy but with widely varying properties. This leads us to propose a simple fast compound scaling strategy that encourages primarily scaling model width, while scaling depth and resolution to a lesser extent. Unlike currently popular scaling strategies, which result in about $O(s)$ increase in model activation w.r.t. scaling flops by a factor of $s$, the proposed fast compound scaling results in close to $O(\sqrt{s})$ increase in activations, while achieving excellent accuracy. This leads to comparable speedups on modern memory-limited hardware (e.g., GPU, TPU). More generally, we hope this work provides a framework for analyzing and selecting scaling strategies under various computational constraints.

Related papers

Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models [7.2703757624760526]
Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling.<n>As we push these scaling boundaries, understanding the practical limits and achieving optimal resource allocation becomes a critical challenge.<n>In this paper, we investigate the scaling plateau of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM)
arXiv Detail & Related papers (2025-05-26T20:58:45Z)
Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory [79.63672515243765]
In this paper, we focus on a standard and realistic scaling setting: majority voting.<n>We show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought.<n>We propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times.
arXiv Detail & Related papers (2025-05-16T08:28:57Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines [20.62274005080048]
Early research established power-law relationships in model performance, leading to compute-optimal scaling strategies. Sparse models, mixture-of-experts, retrieval-augmented learning, and multimodal models often deviate from traditional scaling patterns. scaling behaviors vary across domains such as vision, reinforcement learning, and fine-tuning, underscoring the need for more nuanced approaches.
arXiv Detail & Related papers (2025-02-17T17:20:41Z)
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities. We find that longer CoTs of these o1-like models do not consistently enhance accuracy. We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z)
The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation [48.52206677611072]
Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model. We show that combinations of simple strategies can achieve significant inference speedups over different tasks.
arXiv Detail & Related papers (2024-11-06T09:23:50Z)
The Importance of Being Scalable: Improving the Speed and Accuracy of Neural Network Interatomic Potentials Across Chemical Domains [4.340917737559795]
We study scaling in Neural Network Interatomic Potentials (NNIPs) NNIPs act as surrogate models for ab initio quantum mechanical calculations. We develop an NNIP architecture designed for scaling: the Efficiently Scaled Attention Interatomic Potential (EScAIP)
arXiv Detail & Related papers (2024-10-31T17:35:57Z)
Beyond Uniform Scaling: Exploring Depth Heterogeneity in Neural Architectures [9.91972450276408]
We introduce an automated scaling approach leveraging second-order loss landscape information. Our method is flexible towards skip connections a mainstay in modern vision transformers. We introduce the first intact scaling mechanism for vision transformers, a step towards efficient model scaling.
arXiv Detail & Related papers (2024-02-19T09:52:45Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
Effective Invertible Arbitrary Image Rescaling [77.46732646918936]
Invertible Neural Networks (INN) are able to increase upscaling accuracy significantly by optimizing the downscaling and upscaling cycle jointly. A simple and effective invertible arbitrary rescaling network (IARN) is proposed to achieve arbitrary image rescaling by training only one model in this work. It is shown to achieve a state-of-the-art (SOTA) performance in bidirectional arbitrary rescaling without compromising perceptual quality in LR outputs.
arXiv Detail & Related papers (2022-09-26T22:22:30Z)
ScaleNet: Searching for the Model to Scale [44.05380012545087]
We propose ScaleNet to jointly search base model and scaling strategy. We show our scaled networks enjoy significant performance superiority on various FLOPs.
arXiv Detail & Related papers (2022-07-15T03:16:43Z)
Adaptive Perturbation for Adversarial Attack [50.77612889697216]
We propose a new gradient-based attack method for adversarial examples. We use the exact gradient direction with a scaling factor for generating adversarial perturbations. Our method exhibits higher transferability and outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2021-11-27T07:57:41Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
Revisiting ResNets: Improved Training and Scaling Strategies [54.0162571976267]
Training and scaling strategies may matter more than architectural changes, and the resulting ResNets match recent state-of-the-art models. We show that the best performing scaling strategy depends on the training regime. We design a family of ResNet architectures, ResNet-RS, which are 1.7x - 2.7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet.
arXiv Detail & Related papers (2021-03-13T00:18:19Z)
Explaining Neural Scaling Laws [17.115592382420626]
Population loss of trained deep neural networks often follows precise power-law scaling relations. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size.
arXiv Detail & Related papers (2021-02-12T18:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.