FP8-LM: Training FP8 Large Language Models
- URL: http://arxiv.org/abs/2310.18313v2
- Date: Tue, 19 Dec 2023 12:27:58 GMT
- Title: FP8-LM: Training FP8 Large Language Models
- Authors: Houwen Peng and Kan Wu and Yixuan Wei and Guoshuai Zhao and Yuxiang
Yang and Ze Liu and Yifan Xiong and Ziyue Yang and Bolin Ni and Jingcheng Hu
and Ruihang Li and Miaosen Zhang and Chen Li and Jia Ning and Ruizhe Wang and
Zheng Zhang and Shuguang Liu and Joe Chau and Han Hu and Peng Cheng
- Abstract summary: In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models.
Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
- Score: 47.17804713425323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore FP8 low-bit data formats for efficient training of
large language models (LLMs). Our key insight is that most variables, such as
gradients and optimizer states, in LLM training can employ low-precision data
formats without compromising model accuracy and requiring no changes to
hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision
framework for training LLMs. This framework offers three levels of FP8
utilization to streamline mixed-precision and distributed parallel training for
LLMs. It gradually incorporates 8-bit gradients, optimizer states, and
distributed learning in an incremental manner. Experiment results show that,
during the training of GPT-175B model on H100 GPU platform, our FP8
mixed-precision training framework not only achieved a remarkable 39% reduction
in real memory usage but also ran 75% faster than the widely adopted BF16
framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer
Engine by 37%. This largely reduces the training costs for large foundation
models. Furthermore, our FP8 mixed-precision training methodology is generic.
It can be seamlessly applied to other tasks such as LLM instruction tuning and
reinforcement learning with human feedback, offering savings in fine-tuning
expenses. Our FP8 low-precision training framework is open-sourced at
{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.
Related papers
- Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs [4.5440077473497364]
Large Language Models (LLMs) have attracted significant attention due to their human-like language understanding and generation capabilities.
These models, characterized by their massive scale and extensive training data, continue to push the boundaries of what is possible in natural language processing.
The immense computational demands associated with training such models have spurred ongoing research into optimizing the efficiency of the training process.
arXiv Detail & Related papers (2024-11-10T15:19:42Z) - COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training [47.07768822212081]
COAT (States and Activations for FP8 Training) is a novel FP8 training framework designed to significantly reduce memory footprint when training large models.
COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16.
COAT also achieves a 1.43x end-to-end training speedup compared to BF16.
arXiv Detail & Related papers (2024-10-25T05:59:30Z) - Scaling FP8 training to trillion-token LLMs [26.195547788434908]
We train large language models using FP8 precision on datasets up to 2 trillion tokens.
We uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations.
We introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function.
arXiv Detail & Related papers (2024-09-19T07:15:58Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point [13.693064349530795]
Recent work has shown that 8-bit floating point (FP8) can be used for efficiently training neural networks.
We present a novel method for combining FP8 client training while maintaining a global FP32 server model.
arXiv Detail & Related papers (2024-07-02T18:55:58Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability [7.115739465137031]
BrainFloat16 (BF16) precision has become the de facto standard for large language model pretraining.
However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8 can be a cost-effective option for LLM training.
We propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models.
arXiv Detail & Related papers (2024-05-29T02:42:23Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.