Collage: Light-Weight Low-Precision Strategy for LLM Training
- URL: http://arxiv.org/abs/2405.03637v1
- Date: Mon, 6 May 2024 16:55:30 GMT
- Title: Collage: Light-Weight Low-Precision Strategy for LLM Training
- Authors: Tao Yu, Gaurav Gupta, Karthick Gopalswamy, Amith Mamidala, Hao Zhou, Jeffrey Huynh, Youngsuk Park, Ron Diamant, Anoop Deoras, Luke Huan,
- Abstract summary: We argue that low-precision floating points can perform well provided the error is properly compensated at the critical locations in the training process.
We propose Collage which utilizes multi-component float representation in low-precision to accurately perform operations with numerical errors accounted.
Our method works with commonly used low-precision such as half-precision ($16$-bit floating points) and can be naturally extended to work with even lower precision such as $8$-bit.
- Score: 21.190363633580233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large models training is plagued by the intense compute cost and limited hardware memory. A practical solution is low-precision representation but is troubled by loss in numerical accuracy and unstable training rendering the model less useful. We argue that low-precision floating points can perform well provided the error is properly compensated at the critical locations in the training process. We propose Collage which utilizes multi-component float representation in low-precision to accurately perform operations with numerical errors accounted. To understand the impact of imprecision to training, we propose a simple and novel metric which tracks the lost information during training as well as differentiates various precision strategies. Our method works with commonly used low-precision such as half-precision ($16$-bit floating points) and can be naturally extended to work with even lower precision such as $8$-bit. Experimental results show that pre-training using Collage removes the requirement of using $32$-bit floating-point copies of the model and attains similar/better training performance compared to $(16, 32)$-bit mixed-precision strategy, with up to $3.7\times$ speedup and $\sim 15\%$ to $23\%$ less memory usage in practice.
Related papers
- Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions.
By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z) - Training with Mixed-Precision Floating-Point Assignments [8.5323697848377]
We generate precision assignments for convolutional neural networks that use less memory.
We evaluate our technique on image classification tasks by training convolutional networks on CIFAR-10, CIFAR-100, and ImageNet.
arXiv Detail & Related papers (2023-01-31T08:01:35Z) - Adaptive Low-Precision Training for Embeddings in Click-Through Rate
Prediction [36.605153166169224]
Embedding tables are usually huge in click-through rate (CTR) prediction models.
We formulate a novel quantization training paradigm to compress the embeddings from the training stage, termed low-precision training.
For the first time in CTR models, we successfully train 8-bit embeddings without sacrificing prediction accuracy.
arXiv Detail & Related papers (2022-12-12T07:19:14Z) - Adversarial Unlearning: Reducing Confidence Along Adversarial Directions [88.46039795134993]
We propose a complementary regularization strategy that reduces confidence on self-generated examples.
The method, which we call RCAD, aims to reduce confidence on out-of-distribution examples lying along directions adversarially chosen to increase training loss.
Despite its simplicity, we find on many classification benchmarks that RCAD can be added to existing techniques to increase test accuracy by 1-3% in absolute value.
arXiv Detail & Related papers (2022-06-03T02:26:24Z) - BMPQ: Bit-Gradient Sensitivity Driven Mixed-Precision Quantization of
DNNs from Scratch [11.32458063021286]
This paper presents BMPQ, a training method that uses bit gradients to analyze layer sensitivities and yield mixed-precision quantized models.
It requires a single training iteration but does not need a pre-trained baseline.
Compared to the baseline FP-32 models, BMPQ can yield models that have 15.4x fewer parameter bits with a negligible drop in accuracy.
arXiv Detail & Related papers (2021-12-24T03:16:58Z) - ProgFed: Effective, Communication, and Computation Efficient Federated
Learning by Progressive Training [78.44473677588887]
We propose ProgFed, a progressive training framework for efficient and effective federated learning.
It inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models.
Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z) - How Low Can We Go: Trading Memory for Error in Low-Precision Training [52.94003953419242]
Low-precision arithmetic trains deep learning models using less energy, less memory and less time.
We pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error.
We borrow ideas from meta-learning to learn the tradeoff between memory and error.
arXiv Detail & Related papers (2021-06-17T17:38:07Z) - PositNN: Training Deep Neural Networks with Mixed Low-Precision Posit [5.534626267734822]
The presented research aims to evaluate the feasibility to train deep convolutional neural networks using posits.
A software framework was developed to use simulated posits and quires in end-to-end training and inference.
Results suggest that 8-bit posits can substitute 32-bit floats during training with no negative impact on the resulting loss and accuracy.
arXiv Detail & Related papers (2021-04-30T19:30:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.