Related papers: SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

URL: http://arxiv.org/abs/2405.11582v2
Date: Mon, 17 Jun 2024 05:59:24 GMT
Title: SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization
Authors: Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang,
Abstract summary: This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. We propose a novel method named PRepBN to progressively replace LayerNorm with re- parameterized BatchNorm in training.
Score: 36.84275777364218
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have become foundational architectures for both natural language and computer vision tasks. However, the high computational cost makes it quite challenging to deploy on resource-constraint devices. This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. However, replacing LayerNorm with more efficient BatchNorm in transformer often leads to inferior performance and collapse in training. To address this problem, we propose a novel method named PRepBN to progressively replace LayerNorm with re-parameterized BatchNorm in training. Moreover, we propose a simplified linear attention (SLA) module that is simple yet effective to achieve strong performance. Extensive experiments on image classification as well as object detection demonstrate the effectiveness of our proposed method. For example, our SLAB-Swin obtains $83.6\%$ top-1 accuracy on ImageNet-1K with $16.2$ms latency, which is $2.4$ms less than that of Flatten-Swin with $0.1\%$ higher accuracy. We also evaluated our method for language modeling task and obtain comparable performance and lower latency.Codes are publicly available at https://github.com/xinghaochen/SLAB and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers [18.469378618426294]
We introduce Hamming Attention Distillation (HAD), a framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention.
arXiv Detail & Related papers (2025-02-03T19:24:01Z)
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. We identify and characterise the important components needed for effective model convergence using gradient descent. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z)
The Need for Speed: Pruning Transformers with One Recipe [18.26707877972931]
OPTIN is a tool to increase the efficiency of pre-trained transformer architectures without re-training. It produces state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks. We show a $leq 2$% accuracy degradation from NLP baselines and a $0.5$% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions.
arXiv Detail & Related papers (2024-03-26T17:55:58Z)
Scaling Efficient LLMs [0.0]
"AI scaling law" for transformers suggests that the number of parameters must scale linearly with the size of the data.<n>We propose recurrent transformers, combining the efficacy of transformers with the efficiency of recurrent networks.
arXiv Detail & Related papers (2024-02-22T18:06:19Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Unified Normalization for Accelerating and Stabilizing Transformers [35.07454490355906]
Layer Normalization (LN) normalizes activations within each token to boost robustness. LN requires on-the-fly statistics calculation in inference as well as division and square root operations. We propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations.
arXiv Detail & Related papers (2022-08-02T08:41:31Z)
Monarch: Expressive Structured Matrices for Efficient and Accurate Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones. We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z)
A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models. Prior work on model pruning requires retraining the model. We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z)
Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks. We show that this general approach is robust to rescaling of parameter and loss. We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z)
FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers. We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.