Emformer: Efficient Memory Transformer Based Acoustic Model For Low
Latency Streaming Speech Recognition
- URL: http://arxiv.org/abs/2010.10759v4
- Date: Wed, 30 Dec 2020 07:07:35 GMT
- Title: Emformer: Efficient Memory Transformer Based Acoustic Model For Low
Latency Streaming Speech Recognition
- Authors: Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian
Chan, Frank Zhang, Duc Le, Mike Seltzer
- Abstract summary: Long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity.
A cache mechanism saves the computation for the key and value in self-attention for the left context.
Under average latency of 960 ms, Emformer gets WER $2.50%$ on test-clean and $5.62%$ on test-other.
- Score: 23.496223778642758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes an efficient memory transformer Emformer for low latency
streaming speech recognition. In Emformer, the long-range history context is
distilled into an augmented memory bank to reduce self-attention's computation
complexity. A cache mechanism saves the computation for the key and value in
self-attention for the left context. Emformer applies a parallelized block
processing in training to support low latency models. We carry out experiments
on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets
WER $2.50\%$ on test-clean and $5.62\%$ on test-other. Comparing with a strong
baseline augmented memory transformer (AM-TRF), Emformer gets $4.6$ folds
training speedup and $18\%$ relative real-time factor (RTF) reduction in
decoding with relative WER reduction $17\%$ on test-clean and $9\%$ on
test-other. For a low latency scenario with an average latency of 80 ms,
Emformer achieves WER $3.01\%$ on test-clean and $7.09\%$ on test-other.
Comparing with the LSTM baseline with the same latency and model size, Emformer
gets relative WER reduction $9\%$ and $16\%$ on test-clean and test-other,
respectively.
Related papers
- Simple ReFlow: Improved Techniques for Fast Flow Models [68.32300636049008]
Diffusion and flow-matching models achieve remarkable generative performance but at the cost of many sampling steps.
We propose seven improvements for training dynamics, learning and inference.
We achieve state-of-the-art FID scores (without / with guidance, resp.) for fast generation via neural ODEs.
arXiv Detail & Related papers (2024-10-10T11:00:55Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization [36.84275777364218]
This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules.
LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference.
We propose a novel method named PRepBN to progressively replace LayerNorm with re- parameterized BatchNorm in training.
arXiv Detail & Related papers (2024-05-19T15:22:25Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - A low latency attention module for streaming self-supervised speech representation learning [0.4288177321445912]
Self-latency speech representation learning (SSRL) is a popular use-case for the transformer architecture.
We present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements.
Our implementation also reduces the inference latency from 1.92 to 0.16 seconds.
arXiv Detail & Related papers (2023-02-27T00:44:22Z) - Neural Transducer Training: Reduced Memory Consumption with Sample-wise
Computation [5.355990925686149]
We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample.
We show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched.
As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.
arXiv Detail & Related papers (2022-11-29T14:57:23Z) - Conditional DETR V2: Efficient Detection Transformer with Box Queries [58.9706842210695]
We are interested in an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS.
Inspired by Conditional DETR, an improved DETR with fast training convergence, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point.
We learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence.
arXiv Detail & Related papers (2022-07-18T20:08:55Z) - Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition
With Emformer [0.4588028371034407]
A frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition.
With an average latency of 640ms, our model achieves a relative WER reduction of 6.4% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer.
arXiv Detail & Related papers (2022-03-29T14:31:06Z) - Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition [66.47000813920619]
We propose a non-autoregressive end-to-end speech recognition system called LASO.
Because of the non-autoregressive property, LASO predicts a textual token in the sequence without the dependence on other tokens.
We conduct experiments on publicly available Chinese dataset AISHELL-1.
arXiv Detail & Related papers (2020-05-11T04:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.