Scaling FP8 training to trillion-token LLMs
- URL: http://arxiv.org/abs/2409.12517v2
- Date: Mon, 10 Feb 2025 09:37:59 GMT
- Title: Scaling FP8 training to trillion-token LLMs
- Authors: Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry,
- Abstract summary: We train large language models using FP8 precision on datasets up to 2 trillion tokens.
We uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations.
We introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function.
- Score: 26.195547788434908
- License:
- Abstract: We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a $\sim 34 \%$ throughput improvement. A reference implementation is supplied in https://github.com/Anonymous1252022/Megatron-DeepSpeed.
Related papers
- $μ$nit Scaling: Simple and Scalable FP8 LLM Training [6.447975505471247]
Large Language Model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging.
We demonstrate simple, scalable FP8 training that requires no dynamic scaling factors, even at large model sizes.
We validate our method by training models from 1B to 13B parameters, performing all hidden linear layer computations in FP8.
arXiv Detail & Related papers (2025-02-09T17:31:09Z) - An Investigation of FP8 Across Accelerators for LLM Inference [7.910301381209274]
We provide the first comprehensive analysis of FP8 computation on two AI accelerators: the NVIDIA H100 and Intel Gaudi 2.
Our findings highlight that the Gaudi 2, by leveraging FP8, achieves higher throughput-to-power efficiency during inference.
arXiv Detail & Related papers (2025-02-03T05:26:22Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.
This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training [47.07768822212081]
COAT (States and Activations for FP8 Training) is a novel FP8 training framework designed to significantly reduce memory footprint when training large models.
COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16.
COAT also achieves a 1.43x end-to-end training speedup compared to BF16.
arXiv Detail & Related papers (2024-10-25T05:59:30Z) - To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability [7.115739465137031]
BrainFloat16 (BF16) precision has become the de facto standard for large language model pretraining.
However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8 can be a cost-effective option for LLM training.
We propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models.
arXiv Detail & Related papers (2024-05-29T02:42:23Z) - FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models.
Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z) - Stable and low-precision training for large-scale vision-language models [108.62077651227607]
We introduce new methods for accelerating and stabilizing training for large language-vision models.
For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25%.
For stability, we analyze loss spikes and find they consistently occur 1-8 after the squared gradients become under-estimated.
arXiv Detail & Related papers (2023-04-25T17:38:18Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - Towards Unified INT8 Training for Convolutional Neural Network [83.15673050981624]
We build a unified 8-bit (INT8) training framework for common convolutional neural networks.
First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization.
We propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients.
arXiv Detail & Related papers (2019-12-29T08:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.