Related papers: Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

URL: http://arxiv.org/abs/2010.10759v4
Date: Wed, 30 Dec 2020 07:07:35 GMT
Title: Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition
Authors: Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, Mike Seltzer
Abstract summary: Long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Under average latency of 960 ms, Emformer gets WER $2.50%$ on test-clean and $5.62%$ on test-other.
Score: 23.496223778642758
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in training to support low latency models. We carry out experiments on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets WER $2.50\%$ on test-clean and $5.62\%$ on test-other. Comparing with a strong baseline augmented memory transformer (AM-TRF), Emformer gets $4.6$ folds training speedup and $18\%$ relative real-time factor (RTF) reduction in decoding with relative WER reduction $17\%$ on test-clean and $9\%$ on test-other. For a low latency scenario with an average latency of 80 ms, Emformer achieves WER $3.01\%$ on test-clean and $7.09\%$ on test-other. Comparing with the LSTM baseline with the same latency and model size, Emformer gets relative WER reduction $9\%$ and $16\%$ on test-clean and test-other, respectively.

Related papers

Simple ReFlow: Improved Techniques for Fast Flow Models [68.32300636049008]
Diffusion and flow-matching models achieve remarkable generative performance but at the cost of many sampling steps. We propose seven improvements for training dynamics, learning and inference. We achieve state-of-the-art FID scores (without / with guidance, resp.) for fast generation via neural ODEs.
arXiv Detail & Related papers (2024-10-10T11:00:55Z)
MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation [21.242398582282522]
We introduce MDSGen, a novel framework for vision-guided open-domain sound generation. MDSGen employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves $97.9$% alignment accuracy. Our larger model (131M parameters) reaches nearly $99$% accuracy while requiring $6.5times$ fewer parameters.
arXiv Detail & Related papers (2024-10-03T01:23:44Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens. Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization [36.84275777364218]
This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. We propose a novel method named PRepBN to progressively replace LayerNorm with re- parameterized BatchNorm in training.
arXiv Detail & Related papers (2024-05-19T15:22:25Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module. We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH) In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z)
A low latency attention module for streaming self-supervised speech representation learning [0.4288177321445912]
Self-latency speech representation learning (SSRL) is a popular use-case for the transformer architecture. We present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements. Our implementation also reduces the inference latency from 1.92 to 0.16 seconds.
arXiv Detail & Related papers (2023-02-27T00:44:22Z)
Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation [5.355990925686149]
We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.
arXiv Detail & Related papers (2022-11-29T14:57:23Z)
Conditional DETR V2: Efficient Detection Transformer with Box Queries [58.9706842210695]
We are interested in an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point. We learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence.
arXiv Detail & Related papers (2022-07-18T20:08:55Z)
Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer [0.4588028371034407]
A frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition. With an average latency of 640ms, our model achieves a relative WER reduction of 6.4% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer.
arXiv Detail & Related papers (2022-03-29T14:31:06Z)
Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition [66.47000813920619]
We propose a non-autoregressive end-to-end speech recognition system called LASO. Because of the non-autoregressive property, LASO predicts a textual token in the sequence without the dependence on other tokens. We conduct experiments on publicly available Chinese dataset AISHELL-1.
arXiv Detail & Related papers (2020-05-11T04:45:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.