Compute Cost Amortized Transformer for Streaming ASR
- URL: http://arxiv.org/abs/2207.02393v1
- Date: Tue, 5 Jul 2022 03:06:53 GMT
- Title: Compute Cost Amortized Transformer for Streaming ASR
- Authors: Yi Xie, Jonathan Macoskey, Martin Radfar, Feng-Ju Chang, Brian King,
Ariya Rastrow, Athanasios Mouchtaris, Grant P. Strimel
- Abstract summary: We present a streaming, Transformer-based end-to-end automatic speech recognition architecture.
Our architecture creates sparse computation pathways dynamically at inference time, resulting in selective use of compute resources throughout decoding.
Our best model can achieve a 60% compute cost reduction with only a 3% relative word error rate (WER) increase.
- Score: 23.950740806308687
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present a streaming, Transformer-based end-to-end automatic speech
recognition (ASR) architecture which achieves efficient neural inference
through compute cost amortization. Our architecture creates sparse computation
pathways dynamically at inference time, resulting in selective use of compute
resources throughout decoding, enabling significant reductions in compute with
minimal impact on accuracy. The fully differentiable architecture is trained
end-to-end with an accompanying lightweight arbitrator mechanism operating at
the frame-level to make dynamic decisions on each input while a tunable loss
function is used to regularize the overall level of compute against predictive
performance. We report empirical results from experiments using the compute
amortized Transformer-Transducer (T-T) model conducted on LibriSpeech data. Our
best model can achieve a 60% compute cost reduction with only a 3% relative
word error rate (WER) increase.
Related papers
- Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation [15.377463849213033]
EFA is a novel global context modeling mechanism that focuses on functioning the global non-linearity.
Our ISR method reduces the key-value resolution at the inference phase, which can mitigate the computation-performance trade-off gap.
EDAFormer shows the state-of-the-art performance with the efficient computation compared to the existing transformer-based semantic segmentation models.
arXiv Detail & Related papers (2024-07-24T13:24:25Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance.
Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z) - Adaptive Computation Modules: Granular Conditional Computation For
Efficient Inference [13.000030080938078]
computational cost of transformer models makes them inefficient in low-latency or low-power applications.
We introduce the Adaptive Computation Module (ACM), a generic module that dynamically adapts its computational load to match the estimated difficulty of the input on a per-token basis.
Our evaluation of transformer models in computer vision and speech recognition demonstrates that substituting layers with ACMs significantly reduces inference costs without degrading the downstream accuracy for a wide interval of user-defined budgets.
arXiv Detail & Related papers (2023-12-15T20:39:43Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - Accelerating Attention through Gradient-Based Learned Runtime Pruning [9.109136535767478]
Self-attention is a key enabler of state-of-art accuracy for transformer-based Natural Language Processing models.
This paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training.
We devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism.
arXiv Detail & Related papers (2022-04-07T05:31:13Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z) - Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Controlling Computation versus Quality for Neural Sequence Models [42.525463454120256]
Conditional computation makes neural sequence models (Transformers) more efficient and computation-aware during inference.
We evaluate our approach on two tasks: (i) WMT English-French Translation and (ii) Unsupervised representation learning (BERT)
arXiv Detail & Related papers (2020-02-17T17:54:27Z) - Channel Assignment in Uplink Wireless Communication using Machine
Learning Approach [54.012791474906514]
This letter investigates a channel assignment problem in uplink wireless communication systems.
Our goal is to maximize the sum rate of all users subject to integer channel assignment constraints.
Due to high computational complexity, machine learning approaches are employed to obtain computational efficient solutions.
arXiv Detail & Related papers (2020-01-12T15:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.