Related papers: S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

URL: http://arxiv.org/abs/2409.09099v3
Date: Fri, 27 Dec 2024 09:30:18 GMT
Title: S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training
Authors: Yuezhou Hu, Jun Zhu, Jianfei Chen,
Abstract summary: We propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor.<n>Results show that our method surpasses previous 2:4 pre-training recipes and is comparable even with full parameter models.
Score: 20.113352600259226
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N:M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In light of this, we propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for activation gradient and FP8 quantization for whole process. Results show that our method surpasses previous 2:4 pre-training recipes and is comparable even with full parameter models. Our toolkit is available at https://github.com/huyz2023/2by4-pretrain.

Related papers

KurTail : Kurtosis-based LLM Quantization [51.24081396305435]
KurTail is a new post-training quantization scheme that mitigates outliers in the activations of large language models. It offers a 13.3% boost in MMLU accuracy and a 15.5% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6% MMLU gain and reduces perplexity by 2.9%, all while reducing the training cost.
arXiv Detail & Related papers (2025-03-03T12:43:06Z)
QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models [27.730213115659986]
Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. Traditional fine-tuning methods require backpropagation, which are error-prone in the low-precision settings. We propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision forward passes.
arXiv Detail & Related papers (2025-02-17T22:20:31Z)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models can generate high-quality images, but as they scale, rising memory demands and higher latency pose deployment challenges. We propose SVDQuant, a new 4-bit quantization paradigm to overcome this limitation. We reduce the memory usage for the 12B FLUX.1 models by 3.5$times$, achieving 3.0$times$ speedup over the 4-bit weight-only quantization (W4A16) baseline.
arXiv Detail & Related papers (2024-11-07T18:59:58Z)
Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference [54.2589824716527]
Large language models incur substantial computation and memory movement costs due to their large scale. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
arXiv Detail & Related papers (2024-09-30T14:59:22Z)
Accelerating Transformer Pre-training with 2:4 Sparsity [19.64391647966267]
NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. We propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality. Our algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently.
arXiv Detail & Related papers (2024-04-02T11:12:42Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit. We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
Minimum Variance Unbiased N:M Sparsity for the Neural Gradients [29.555643722721882]
In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2. We examine how this method can be used also for the neural gradients.
arXiv Detail & Related papers (2022-03-21T13:59:43Z)
LG-LSQ: Learned Gradient Linear Symmetric Quantization [3.6816597150770387]
Deep neural networks with lower precision weights have advantages in terms of the cost of memory space and accelerator power. The main challenge associated with the quantization algorithm is maintaining accuracy at low bit-widths. We propose learned gradient linear symmetric quantization (LG-LSQ) as a method for quantizing weights and activation functions to low bit-widths.
arXiv Detail & Related papers (2022-02-18T03:38:12Z)
Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats [30.28190081697757]
Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. We suggest a $textitlogarithmic unbiased quantization$ (LUQ) method to quantize both the forward and backward phases to 4-bit.
arXiv Detail & Related papers (2021-12-19T14:16:55Z)
Efficient Neural Network Training via Forward and Backward Propagation Sparsification [26.301103403328312]
We propose an efficient sparse training method with completely sparse forward and backward passes. Our algorithm is much more effective in accelerating the training process, up to an order of magnitude faster.
arXiv Detail & Related papers (2021-11-10T13:49:47Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z)
Towards Unified INT8 Training for Convolutional Neural Network [83.15673050981624]
We build a unified 8-bit (INT8) training framework for common convolutional neural networks. First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization. We propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients.
arXiv Detail & Related papers (2019-12-29T08:37:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.