Unit Scaling: Out-of-the-Box Low-Precision Training
- URL: http://arxiv.org/abs/2303.11257v2
- Date: Tue, 30 May 2023 22:05:40 GMT
- Title: Unit Scaling: Out-of-the-Box Low-Precision Training
- Authors: Charlie Blake, Douglas Orr, Carlo Luschi
- Abstract summary: Unit scaling is a paradigm for designing deep learning models that simplifies the use of low-precision number formats.
Training in FP16 or the recently proposed FP8 formats offers substantial efficiency gains, but can lack sufficient range for out-of-the-box training.
Unit scaling addresses this by introducing a principled approach to model numerics: seeking unit variance of all weights, activations and gradients at initialisation.
- Score: 1.7188280334580197
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present unit scaling, a paradigm for designing deep learning models that
simplifies the use of low-precision number formats. Training in FP16 or the
recently proposed FP8 formats offers substantial efficiency gains, but can lack
sufficient range for out-of-the-box training. Unit scaling addresses this by
introducing a principled approach to model numerics: seeking unit variance of
all weights, activations and gradients at initialisation. Unlike alternative
methods, this approach neither requires multiple training runs to find a
suitable scale nor has significant computational overhead. We demonstrate the
efficacy of unit scaling across a range of models and optimisers. We further
show that existing models can be adapted to be unit-scaled, training BERT-Large
in FP16 and then FP8 with no degradation in accuracy.
Related papers
- Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models [25.700481606604647]
Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost.
With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.
arXiv Detail & Related papers (2025-02-17T05:33:11Z) - $μ$nit Scaling: Simple and Scalable FP8 LLM Training [6.447975505471247]
Large Language Model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging.
We demonstrate simple, scalable FP8 training that requires no dynamic scaling factors, even at large model sizes.
We validate our method by training models from 1B to 13B parameters, performing all hidden linear layer computations in FP8.
arXiv Detail & Related papers (2025-02-09T17:31:09Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.
This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs [4.5440077473497364]
Large Language Models (LLMs) have attracted significant attention due to their human-like language understanding and generation capabilities.
These models, characterized by their massive scale and extensive training data, continue to push the boundaries of what is possible in natural language processing.
The immense computational demands associated with training such models have spurred ongoing research into optimizing the efficiency of the training process.
arXiv Detail & Related papers (2024-11-10T15:19:42Z) - Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference.
For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data.
For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z) - COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training [47.07768822212081]
COAT (States and Activations for FP8 Training) is a novel FP8 training framework designed to significantly reduce memory footprint when training large models.
COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16.
COAT also achieves a 1.43x end-to-end training speedup compared to BF16.
arXiv Detail & Related papers (2024-10-25T05:59:30Z) - Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models.
Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z) - Training and inference of large language models using 8-bit floating
point [3.689110902209004]
This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations.
We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B.
arXiv Detail & Related papers (2023-09-29T13:24:33Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.