Stable and low-precision training for large-scale vision-language models
- URL: http://arxiv.org/abs/2304.13013v2
- Date: Tue, 17 Oct 2023 00:11:15 GMT
- Title: Stable and low-precision training for large-scale vision-language models
- Authors: Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali
Farhadi, Ludwig Schmidt
- Abstract summary: We introduce new methods for accelerating and stabilizing training for large language-vision models.
For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25%.
For stability, we analyze loss spikes and find they consistently occur 1-8 after the squared gradients become under-estimated.
- Score: 108.62077651227607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce new methods for 1) accelerating and 2) stabilizing training for
large language-vision models. 1) For acceleration, we introduce SwitchBack, a
linear layer for int8 quantized training which provides a speed-up of 13-25%
while matching the performance of bfloat16 training within 0.1 percentage
points for the 1B parameter CLIP ViT-Huge -- the largest int8 training to date.
Our main focus is int8 as GPU support for float8 is rare, though we also
analyze float8 training through simulation. While SwitchBack proves effective
for float8, we show that standard techniques are also successful if the network
is trained and initialized so that large feature magnitudes are discouraged,
which we accomplish via layer-scale initialized with zeros. 2) For stability,
we analyze loss spikes and find they consistently occur 1-8 iterations after
the squared gradients become under-estimated by their AdamW second moment
estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids
loss spikes when training a CLIP ViT-Huge model and outperforms gradient
clipping at the scales we test.
Related papers
- Stepping Forward on the Last Mile [8.756033984943178]
We propose a series of algorithm enhancements that further reduce the memory footprint, and the accuracy gap compared to backpropagation.
Our results demonstrate that on the last mile of model customization on edge devices, training with fixed-point forward gradients is a feasible and practical approach.
arXiv Detail & Related papers (2024-11-06T16:33:21Z) - COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training [47.07768822212081]
COAT (States and Activations for FP8 Training) is a novel FP8 training framework designed to significantly reduce memory footprint when training large models.
COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16.
COAT also achieves a 1.43x end-to-end training speedup compared to BF16.
arXiv Detail & Related papers (2024-10-25T05:59:30Z) - Scaling FP8 training to trillion-token LLMs [26.195547788434908]
We train large language models using FP8 precision on datasets up to 2 trillion tokens.
We uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations.
We introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function.
arXiv Detail & Related papers (2024-09-19T07:15:58Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - Do We Need Zero Training Loss After Achieving Zero Training Error? [76.44358201918156]
We propose a direct solution called emphflooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value.
We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss.
arXiv Detail & Related papers (2020-02-20T12:50:49Z) - Shifted and Squeezed 8-bit Floating Point format for Low-Precision
Training of Deep Neural Networks [13.929168096016957]
We introduce a novel methodology for training deep neural networks using 8-bit floating point (FP8) numbers.
Reduced bit precision allows for a larger effective memory and increased computational speed.
We show that, unlike previous 8-bit precision training methods, the proposed method works out-of-the-box for representative models.
arXiv Detail & Related papers (2020-01-16T06:38:27Z) - Towards Unified INT8 Training for Convolutional Neural Network [83.15673050981624]
We build a unified 8-bit (INT8) training framework for common convolutional neural networks.
First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization.
We propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients.
arXiv Detail & Related papers (2019-12-29T08:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.