S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training
- URL: http://arxiv.org/abs/2409.09099v3
- Date: Fri, 27 Dec 2024 09:30:18 GMT
- Title: S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training
- Authors: Yuezhou Hu, Jun Zhu, Jianfei Chen,
- Abstract summary: We propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor.
Results show that our method surpasses previous 2:4 pre-training recipes and is comparable even with full parameter models.
- Score: 20.113352600259226
- License:
- Abstract: Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N:M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In light of this, we propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for activation gradient and FP8 quantization for whole process. Results show that our method surpasses previous 2:4 pre-training recipes and is comparable even with full parameter models. Our toolkit is available at https://github.com/huyz2023/2by4-pretrain.
Related papers
- QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models [27.730213115659986]
Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference.
Traditional fine-tuning methods require backpropagation, which are error-prone in the low-precision settings.
We propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision forward passes.
arXiv Detail & Related papers (2025-02-17T22:20:31Z) - A Proximal Operator for Inducing 2:4-Sparsity [68.98036844970986]
We derive a regularizer that exploits the local correlation of features to find better sparsity masks in trained models.
We illustrate our method on toy problems and apply it to pruning entire large language models up to 70B parameters.
arXiv Detail & Related papers (2025-01-29T22:05:17Z) - Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference [54.2589824716527]
Large language models incur substantial computation and memory movement costs due to their large scale.
Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation.
We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation.
The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
arXiv Detail & Related papers (2024-09-30T14:59:22Z) - Accelerating Transformer Pre-training with 2:4 Sparsity [19.64391647966267]
NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent.
We propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality.
Our algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently.
arXiv Detail & Related papers (2024-04-02T11:12:42Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Minimum Variance Unbiased N:M Sparsity for the Neural Gradients [29.555643722721882]
In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2.
We examine how this method can be used also for the neural gradients.
arXiv Detail & Related papers (2022-03-21T13:59:43Z) - Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats [30.28190081697757]
Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training.
We suggest a $textitlogarithmic unbiased quantization$ (LUQ) method to quantize both the forward and backward phases to 4-bit.
arXiv Detail & Related papers (2021-12-19T14:16:55Z) - Efficient Neural Network Training via Forward and Backward Propagation
Sparsification [26.301103403328312]
We propose an efficient sparse training method with completely sparse forward and backward passes.
Our algorithm is much more effective in accelerating the training process, up to an order of magnitude faster.
arXiv Detail & Related papers (2021-11-10T13:49:47Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.