Related papers: Collage: Light-Weight Low-Precision Strategy for LLM Training

Collage: Light-Weight Low-Precision Strategy for LLM Training

URL: http://arxiv.org/abs/2405.03637v1
Date: Mon, 6 May 2024 16:55:30 GMT
Title: Collage: Light-Weight Low-Precision Strategy for LLM Training
Authors: Tao Yu, Gaurav Gupta, Karthick Gopalswamy, Amith Mamidala, Hao Zhou, Jeffrey Huynh, Youngsuk Park, Ron Diamant, Anoop Deoras, Luke Huan,
Abstract summary: We argue that low-precision floating points can perform well provided the error is properly compensated at the critical locations in the training process. We propose Collage which utilizes multi-component float representation in low-precision to accurately perform operations with numerical errors accounted. Our method works with commonly used low-precision such as half-precision ($16$-bit floating points) and can be naturally extended to work with even lower precision such as $8$-bit.
Score: 21.190363633580233
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large models training is plagued by the intense compute cost and limited hardware memory. A practical solution is low-precision representation but is troubled by loss in numerical accuracy and unstable training rendering the model less useful. We argue that low-precision floating points can perform well provided the error is properly compensated at the critical locations in the training process. We propose Collage which utilizes multi-component float representation in low-precision to accurately perform operations with numerical errors accounted. To understand the impact of imprecision to training, we propose a simple and novel metric which tracks the lost information during training as well as differentiates various precision strategies. Our method works with commonly used low-precision such as half-precision ($16$-bit floating points) and can be naturally extended to work with even lower precision such as $8$-bit. Experimental results show that pre-training using Collage removes the requirement of using $32$-bit floating-point copies of the model and attains similar/better training performance compared to $(16, 32)$-bit mixed-precision strategy, with up to $3.7\times$ speedup and $\sim 15\%$ to $23\%$ less memory usage in practice.

Related papers

Stochastic Rounding for LLM Training: Theory and Practice [15.071158535119539]
We leverage rounding (SR) to address numerical errors with low-precision representation. Empirical results from pre-training models with up to 6.7B parameters, for the first time, demonstrate that our BF16 with SR strategy outperforms (BF16, FP32) mixed precision strategies.
arXiv Detail & Related papers (2025-02-27T22:08:08Z)
QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models [27.730213115659986]
Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. Traditional fine-tuning methods require backpropagation, which are error-prone in the low-precision settings. We propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision forward passes.
arXiv Detail & Related papers (2025-02-17T22:20:31Z)
Scaling Laws for Floating Point Quantization Training [47.174957621592775]
Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. In this paper, we thoroughly explore the effects of floating-point quantization targets, exponent bits, mantissa bits, and the calculation of the scaling factor in floating-point quantization training performance.
arXiv Detail & Related papers (2025-01-05T02:30:41Z)
Direct Quantized Training of Language Models with Stochastic Rounding [12.028887152979046]
This paper explores the potential of directly updating the quantized low-precision weight matrices without relying on the straight-through estimator during backpropagation. Experimental results on our LLaMA-structured models indicate that training with only low-precision weights is feasible even when they are constrained to ternary values. Our models can also perform inference using ternary weights, showcasing their flexibility in deployment.
arXiv Detail & Related papers (2024-12-06T05:41:11Z)
Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning. One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems. We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z)
Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z)
Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
Training with Mixed-Precision Floating-Point Assignments [8.5323697848377]
We generate precision assignments for convolutional neural networks that use less memory. We evaluate our technique on image classification tasks by training convolutional networks on CIFAR-10, CIFAR-100, and ImageNet.
arXiv Detail & Related papers (2023-01-31T08:01:35Z)
Adaptive Low-Precision Training for Embeddings in Click-Through Rate Prediction [36.605153166169224]
Embedding tables are usually huge in click-through rate (CTR) prediction models. We formulate a novel quantization training paradigm to compress the embeddings from the training stage, termed low-precision training. For the first time in CTR models, we successfully train 8-bit embeddings without sacrificing prediction accuracy.
arXiv Detail & Related papers (2022-12-12T07:19:14Z)
Adversarial Unlearning: Reducing Confidence Along Adversarial Directions [88.46039795134993]
We propose a complementary regularization strategy that reduces confidence on self-generated examples. The method, which we call RCAD, aims to reduce confidence on out-of-distribution examples lying along directions adversarially chosen to increase training loss. Despite its simplicity, we find on many classification benchmarks that RCAD can be added to existing techniques to increase test accuracy by 1-3% in absolute value.
arXiv Detail & Related papers (2022-06-03T02:26:24Z)
BMPQ: Bit-Gradient Sensitivity Driven Mixed-Precision Quantization of DNNs from Scratch [11.32458063021286]
This paper presents BMPQ, a training method that uses bit gradients to analyze layer sensitivities and yield mixed-precision quantized models. It requires a single training iteration but does not need a pre-trained baseline. Compared to the baseline FP-32 models, BMPQ can yield models that have 15.4x fewer parameter bits with a negligible drop in accuracy.
arXiv Detail & Related papers (2021-12-24T03:16:58Z)
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning. ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
How Low Can We Go: Trading Memory for Error in Low-Precision Training [52.94003953419242]
Low-precision arithmetic trains deep learning models using less energy, less memory and less time. We pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error. We borrow ideas from meta-learning to learn the tradeoff between memory and error.
arXiv Detail & Related papers (2021-06-17T17:38:07Z)
PositNN: Training Deep Neural Networks with Mixed Low-Precision Posit [5.534626267734822]
The presented research aims to evaluate the feasibility to train deep convolutional neural networks using posits. A software framework was developed to use simulated posits and quires in end-to-end training and inference. Results suggest that 8-bit posits can substitute 32-bit floats during training with no negative impact on the resulting loss and accuracy.
arXiv Detail & Related papers (2021-04-30T19:30:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.