Related papers: Accelerating Transformer Pre-training with 2:4 Sparsity

Accelerating Transformer Pre-training with 2:4 Sparsity

URL: http://arxiv.org/abs/2404.01847v3
Date: Sun, 27 Oct 2024 14:40:08 GMT
Title: Accelerating Transformer Pre-training with 2:4 Sparsity
Authors: Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, Jun Zhu,
Abstract summary: NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. We propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality. Our algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently.
Score: 19.64391647966267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large transformers is slow, but recent innovations on GPU architecture give us an advantage. NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. In the light of this property, we comprehensively investigate the feasibility of accelerating feed-forward networks (FFNs) of transformers in pre-training. First, we define a ``flip rate'' to monitor the stability of a 2:4 training process. Utilizing this metric, we propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator by applying the masked decay term on gradients, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality by a dense fine-tuning procedure near the end of pre-training. Besides, we devise two techniques to practically accelerate training: to calculate transposable 2:4 masks by convolution, and to accelerate gated activation functions by reducing GPU L2 cache miss. Experiments show that our 2:4 sparse training algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently. Our toolkit is available at https://github.com/huyz2023/2by4-pretrain.

Related papers

A Stable Whitening Optimizer for Efficient Neural Network Training [101.89246340672246]
Building on the Shampoo family of algorithms, we identify and alleviate three key issues, resulting in the proposed SPlus method.<n>First, we find that naive Shampoo is prone to divergence when matrix-inverses are cached for long periods.<n>Second, we adapt a shape-aware scaling to enable learning rate transfer across network width.<n>Third, we find that high learning rates result in large parameter noise, and propose a simple iterate-averaging scheme which unblocks faster learning.
arXiv Detail & Related papers (2025-06-08T18:43:31Z)
High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution [87.56382172827526]
High-frequency regions are most critical for reconstruction.<n>We propose a training-free adaptive masking module for acceleration.<n>Our method reduces FLOPs by 24--43% for state-of-the-art models.
arXiv Detail & Related papers (2025-05-11T13:18:03Z)
S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training [20.113352600259226]
We propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Results show that our method surpass previous 2:4 pre-training recipes and is comparable even with full parameter models.
arXiv Detail & Related papers (2024-09-13T08:29:36Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
Training Transformers with 4-bit Integers [21.861232105539933]
Quantizing activation, weight, and gradient to 4-bit is promising to accelerate neural network training. Existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic.
arXiv Detail & Related papers (2023-06-21T02:45:01Z)
MixFormerV2: Efficient Fully Transformer Tracking [49.07428299165031]
Transformer-based trackers have achieved strong accuracy on the standard benchmarks. But their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. We propose a fully transformer tracking framework, coined as emphMixFormerV2, without any dense convolutional operation and complex score prediction module.
arXiv Detail & Related papers (2023-05-25T09:50:54Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization [29.96026533220083]
This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization. Experiment results show GPUSQ-ViT scheme achieves state-of-the-art compression by reducing vision transformer models 6.4-12.7 times on model size and 30.3-62 times on FLOPs.
arXiv Detail & Related papers (2023-05-18T05:55:48Z)
Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction [29.61757744974324]
Deep neural networks require significant memory and computation costs. Sparse training is one of the most common techniques to reduce these costs. In this work, we aim to overcome this problem and achieve space-time co-efficiency.
arXiv Detail & Related papers (2023-01-09T18:50:03Z)
NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction [95.37644907940857]
We propose a fast neural surface reconstruction approach, called NeuS2. NeuS2 achieves two orders of magnitude improvement in terms of acceleration without compromising reconstruction quality. We extend our method for fast training of dynamic scenes, with a proposed incremental training strategy and a novel global transformation prediction component.
arXiv Detail & Related papers (2022-12-10T07:19:43Z)
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [158.19276683455254]
Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence. Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases. In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point. We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
arXiv Detail & Related papers (2022-08-13T16:04:39Z)
Primer: Searching for Efficient Transformers for Language Modeling [79.2677566332444]
Training and inference costs of large Transformer models have grown rapidly and become expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
arXiv Detail & Related papers (2021-09-17T17:50:39Z)
Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs [15.465530153038927]
Machine/deep-learning (ML/DL) based techniques are emerging as a driving force behind many cutting-edge technologies. However, training these models involving large parameters is both time-consuming and energy-hogging.
arXiv Detail & Related papers (2021-09-16T04:12:51Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.