Folding Attention: Memory and Power Optimization for On-Device
Transformer-based Streaming Speech Recognition
- URL: http://arxiv.org/abs/2309.07988v3
- Date: Fri, 19 Jan 2024 00:28:45 GMT
- Title: Folding Attention: Memory and Power Optimization for On-Device
Transformer-based Streaming Speech Recognition
- Authors: Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N. Iandola, Zhaoheng
Ni, Ernie Chang, Yangyang Shi, Vikas Chandra
- Abstract summary: Streaming speech recognition models usually process a limited number of tokens each time.
bottleneck lies in the linear projection layers of multi-head attention and feedforward networks.
We propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency.
- Score: 19.772585241974138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models excel in speech recognition. Existing efforts to
optimize Transformer inference, typically for long-context applications, center
on simplifying attention score calculations. However, streaming speech
recognition models usually process a limited number of tokens each time, making
attention score calculation less of a bottleneck. Instead, the bottleneck lies
in the linear projection layers of multi-head attention and feedforward
networks, constituting a substantial portion of the model size and contributing
significantly to computation, memory, and power usage.
To address this bottleneck, we propose folding attention, a technique
targeting these linear layers, significantly reducing model size and improving
memory and power efficiency. Experiments on on-device Transformer-based
streaming speech recognition models show that folding attention reduces model
size (and corresponding memory consumption) by up to 24% and power consumption
by up to 23%, all without compromising model accuracy or computation overhead.
Related papers
- Quality Scalable Quantization Methodology for Deep Learning on Edge [0.20718016474717196]
Deep Learning Architectures employ heavy computations and bulk of the computational energy is taken up by the convolution operations in the Convolutional Neural Networks.
The proposed work is to reduce the energy consumption and size of CNN for using machine learning techniques in edge computing on ubiquitous computing devices.
The experiments done on LeNet and ConvNets show an increase upto 6% of zeros and memory savings upto 82.4919% while keeping the accuracy near the state of the art.
arXiv Detail & Related papers (2024-07-15T22:00:29Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - Point Transformer V3: Simpler, Faster, Stronger [88.80496333515325]
This paper focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing.
We present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms.
PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios.
arXiv Detail & Related papers (2023-12-15T18:59:59Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - When to Use Efficient Self Attention? Profiling Text, Speech and Image
Transformer Variants [39.00433193973159]
We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision.
We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models.
To conduct this analysis for speech, we introduce L-HuBERT, a novel local-attention variant of a self-supervised speech model.
arXiv Detail & Related papers (2023-06-14T17:59:02Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Variable Bitrate Neural Fields [75.24672452527795]
We present a dictionary method for compressing feature grids, reducing their memory consumption by up to 100x.
We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available.
arXiv Detail & Related papers (2022-06-15T17:58:34Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - Streaming Attention-Based Models with Augmented Memory for End-to-End
Speech Recognition [26.530909772863417]
We build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution.
The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory.
On the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streaming approaches reported so far.
arXiv Detail & Related papers (2020-11-03T00:43:58Z) - Self-attention encoding and pooling for speaker recognition [16.96341561111918]
We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances.
SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification.
We have evaluated this approach on both VoxCeleb1 & 2 datasets.
arXiv Detail & Related papers (2020-08-03T09:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.