Scalify: scale propagation for efficient low-precision LLM training
- URL: http://arxiv.org/abs/2407.17353v1
- Date: Wed, 24 Jul 2024 15:26:01 GMT
- Title: Scalify: scale propagation for efficient low-precision LLM training
- Authors: Paul Balança, Sam Hosegood, Carlo Luschi, Andrew Fitzgibbon,
- Abstract summary: Low-precision formats such as float8 have been introduced in machine learning accelerated hardware to improve computational efficiency for large language models training and inference.
We present Scalify, a end-to-end scale propagation paradigm for computational graphs.
- Score: 1.4999444543328293
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Low-precision formats such as float8 have been introduced in machine learning accelerated hardware to improve computational efficiency for large language models training and inference. Nevertheless, adoption by the ML community has been slowed down by the complex, and sometimes brittle, techniques required to match higher precision training accuracy. In this work, we present Scalify, a end-to-end scale propagation paradigm for computational graphs, generalizing and formalizing existing tensor scaling methods. Experiment results show that Scalify supports out-of-the-box float8 matrix multiplication and gradients representation, as well as float16 optimizer state storage. Our JAX implementation of Scalify is open-sourced at https://github.com/graphcore-research/jax-scalify
Related papers
- MPX: Mixed Precision Training for JAX [54.62458721568289]
Mixed-precision training has emerged as an indispensable tool for enhancing the efficiency of neural network training.<n>We propose MPX, a mixed-precision training toolbox for JAX that simplifies and accelerates the training of large-scale neural networks.<n>MPX seamlessly integrates with popular toolboxes such as Equinox and Flax, allowing users to convert full-precision pipelines to mixed-precision versions.
arXiv Detail & Related papers (2025-07-04T05:47:04Z) - Unified Scaling Laws for Compressed Representations [69.72517034565467]
We investigate whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations.<n>Our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric.<n>We extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.
arXiv Detail & Related papers (2025-06-02T16:52:51Z) - Recipes for Pre-training LLMs with MXFP8 [0.0]
Precision scaling has emerged as a compelling technique for improving GPU efficiency without sacrificing accuracy.<n>MX-formats offer improved numeric stability compared to other reduced-precision representations.<n>We show an improved rounding mode, which uses round-to-infinity to compute scaling factors, enables successful pre-training in MXFP8 for an 8B model on 15T tokens.
arXiv Detail & Related papers (2025-05-30T21:08:15Z) - QuEST: Stable Training of LLMs with 1-Bit Weights and Activations [27.644652093888745]
QuEST is a new method for training sparse or quantized language models.
We show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions.
We provide GPU kernel support showing that models produced by QuEST can be executed efficiently.
arXiv Detail & Related papers (2025-02-07T15:23:34Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - HesScale: Scalable Computation of Hessian Diagonals [2.398608007786179]
HesScale is a scalable approach to approximating the diagonal of the Hessian matrix.
We show that HesScale has the same computational complexity as backpropagation.
arXiv Detail & Related papers (2022-10-20T23:50:56Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Training with reduced precision of a support vector machine model for
text classification [0.0]
This work is focused on comparing the efficiency of SVM model trained using reduced precision with its original form.
The main advantage of using quantization is decrease in computation time and in memory footprint on the dedicated hardware platform.
arXiv Detail & Related papers (2020-07-17T11:59:30Z) - AdaScale SGD: A User-Friendly Algorithm for Distributed Training [29.430153773234363]
We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training.
By continually adapting to the gradient's variance, AdaScale achieves speed-ups for a wide range of batch sizes.
This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.
arXiv Detail & Related papers (2020-07-09T23:26:13Z) - Real-Time Regression with Dividing Local Gaussian Processes [62.01822866877782]
Local Gaussian processes are a novel, computationally efficient modeling approach based on Gaussian process regression.
Due to an iterative, data-driven division of the input space, they achieve a sublinear computational complexity in the total number of training points in practice.
A numerical evaluation on real-world data sets shows their advantages over other state-of-the-art methods in terms of accuracy as well as prediction and update speed.
arXiv Detail & Related papers (2020-06-16T18:43:31Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - Improving the convergence of SGD through adaptive batch sizes [0.1813006808606333]
Mini-batch gradient descent (SGD) and variants thereof approximate the objective function's gradient with a small number of training examples.
This work presents a method to adapt the batch size to the model's training loss.
arXiv Detail & Related papers (2019-10-18T01:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.