Related papers: Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs

Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs

URL: http://arxiv.org/abs/2411.08719v1
Date: Sun, 10 Nov 2024 15:19:42 GMT
Title: Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs
Authors: Kazuki Fujii, Taishi Nakamura, Rio Yokota,
Abstract summary: Large Language Models (LLMs) have attracted significant attention due to their human-like language understanding and generation capabilities. These models, characterized by their massive scale and extensive training data, continue to push the boundaries of what is possible in natural language processing. The immense computational demands associated with training such models have spurred ongoing research into optimizing the efficiency of the training process.
Score: 4.5440077473497364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have attracted significant attention due to their human-like language understanding and generation capabilities, as well as their applicability across various domains. These models, characterized by their massive scale and extensive training data, continue to push the boundaries of what is possible in natural language processing. The Llama 3 series, for instance, exemplifies this trend with its flagship model boasting 405 billion parameters trained on 15.6 trillion tokens. The immense computational demands associated with training such models have spurred ongoing research into optimizing the efficiency of the training process, particularly through the use of lower-precision formats. NVIDIA's H100 GPU, which introduces support for FP8 in addition to the more conventional FP16 and BF16 formats, has emerged as a focal point in this optimization effort. Preliminary studies suggest that FP8 could offer substantial reductions in training time without sacrificing model performance when compared to BF16, making it a promising candidate for large-scale model training. However, the broader implications of adopting FP8, particularly in terms of training stability and downstream task performance, have yet to be fully understood. In this study, we delve into the practical trade-offs involved in adopting FP8 over BF16 for training LLMs.

Related papers

Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs [123.25404278506585]
We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs) To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters.
arXiv Detail & Related papers (2025-04-10T15:41:51Z)
Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models [25.700481606604647]
Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost. With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.
arXiv Detail & Related papers (2025-02-17T05:33:11Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs. This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks. We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training [47.07768822212081]
COAT (States and Activations for FP8 Training) is a novel FP8 training framework designed to significantly reduce memory footprint when training large models. COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16. COAT also achieves a 1.43x end-to-end training speedup compared to BF16.
arXiv Detail & Related papers (2024-10-25T05:59:30Z)
Optimization Hyper-parameter Laws for Large Language Models [56.322914260197734]
We present Opt-Laws, a framework that captures the relationship between hyper- parameters and training outcomes. Our validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss. This approach significantly reduces computational costs while enhancing overall model performance.
arXiv Detail & Related papers (2024-09-07T09:37:19Z)
Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability [7.115739465137031]
BrainFloat16 (BF16) precision has become the de facto standard for large language model pretraining. However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8 can be a cost-effective option for LLM training. We propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models.
arXiv Detail & Related papers (2024-05-29T02:42:23Z)
FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z)
Training and inference of large language models using 8-bit floating point [3.689110902209004]
This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B.
arXiv Detail & Related papers (2023-09-29T13:24:33Z)
Unit Scaling: Out-of-the-Box Low-Precision Training [1.7188280334580197]
Unit scaling is a paradigm for designing deep learning models that simplifies the use of low-precision number formats. Training in FP16 or the recently proposed FP8 formats offers substantial efficiency gains, but can lack sufficient range for out-of-the-box training. Unit scaling addresses this by introducing a principled approach to model numerics: seeking unit variance of all weights, activations and gradients at initialisation.
arXiv Detail & Related papers (2023-03-20T16:42:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.