Related papers: Memory Efficient Mixed-Precision Optimizers

Memory Efficient Mixed-Precision Optimizers

URL: http://arxiv.org/abs/2309.12381v1
Date: Thu, 21 Sep 2023 13:55:29 GMT
Title: Memory Efficient Mixed-Precision Optimizers
Authors: Basile Lewandowski and Atli Kosson
Abstract summary: Mixed precision optimization techniques leverage the use of both single and half-precision floating point arithmetic. In practice, we achieve up to 25% lower memory use and 15% faster training while maintaining the same level of accuracy.
Score: 4.295034299713293
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional optimization methods rely on the use of single-precision floating point arithmetic, which can be costly in terms of memory size and computing power. However, mixed precision optimization techniques leverage the use of both single and half-precision floating point arithmetic to reduce memory requirements while maintaining model accuracy. We provide here an algorithm to further reduce memory usage during the training of a model by getting rid of the floating point copy of the parameters, virtually keeping only half-precision numbers. We also explore the benefits of getting rid of the gradient's value by executing the optimizer step during the back-propagation. In practice, we achieve up to 25% lower peak memory use and 15% faster training while maintaining the same level of accuracy.

Related papers

HOT: Hadamard-based Optimized Training [7.193483612237862]
It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training. In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation. Our extensive analysis shows that HOT achieves up to 75% memory savings and a 2.6 times acceleration on real
arXiv Detail & Related papers (2025-03-27T08:37:24Z)
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection [17.54863041098623]
We present COAP, a memory-efficient method that minimizes computational overhead while maintaining training performance. For LLaMA-1B, it reduces memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW. With 8-bit quantization, COAP cuts memory by 81% and 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
arXiv Detail & Related papers (2024-11-26T03:50:52Z)
AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization [5.572159724234467]
Mixed-precision quantization distinguishes between important and unimportant parameters. Existing approaches can only identify important parameters through qualitative analysis and manual experiments. We propose a new criterion, so-called 'precision alignment', to build a quantitative framework to holistically evaluate the importance of parameters.
arXiv Detail & Related papers (2024-09-25T01:39:02Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
Memory Efficient Optimizers with 4-bit States [22.605392665667136]
We push states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments. We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization. Our 4-bits are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning.
arXiv Detail & Related papers (2023-09-04T10:27:17Z)
Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error. We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z)
CAME: Confidence-guided Adaptive Memory Efficient Optimization [20.009302737137787]
Adaptive gradient methods have demonstrated excellent performance in the training of large language models. The need for maintaining second-moment estimates requires maintaining a high cost of extra memory overheads. Several memory-efficients have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. We propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods.
arXiv Detail & Related papers (2023-07-05T06:05:36Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Training with Mixed-Precision Floating-Point Assignments [8.5323697848377]
We generate precision assignments for convolutional neural networks that use less memory. We evaluate our technique on image classification tasks by training convolutional networks on CIFAR-10, CIFAR-100, and ImageNet.
arXiv Detail & Related papers (2023-01-31T08:01:35Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.