Streaming Attention-Based Models with Augmented Memory for End-to-End
Speech Recognition
- URL: http://arxiv.org/abs/2011.07120v1
- Date: Tue, 3 Nov 2020 00:43:58 GMT
- Title: Streaming Attention-Based Models with Augmented Memory for End-to-End
Speech Recognition
- Authors: Ching-Feng Yeh, Yongqiang Wang, Yangyang Shi, Chunyang Wu, Frank
Zhang, Julian Chan, Michael L. Seltzer
- Abstract summary: We build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution.
The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory.
On the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streaming approaches reported so far.
- Score: 26.530909772863417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based models have been gaining popularity recently for their strong
performance demonstrated in fields such as machine translation and automatic
speech recognition. One major challenge of attention-based models is the need
of access to the full sequence and the quadratically growing computational cost
concerning the sequence length. These characteristics pose challenges,
especially for low-latency scenarios, where the system is often required to be
streaming. In this paper, we build a compact and streaming speech recognition
system on top of the end-to-end neural transducer architecture with
attention-based modules augmented with convolution. The proposed system equips
the end-to-end models with the streaming capability and reduces the large
footprint from the streaming attention-based model using augmented memory. On
the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on
test-clean and 5.8% on test-other, to our best knowledge the lowest among
streaming approaches reported so far.
Related papers
- Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - End-to-end streaming model for low-latency speech anonymization [11.098498920630782]
We propose a streaming model that achieves speaker anonymization with low latency.
The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder.
We present evaluation results from two implementations of our system.
arXiv Detail & Related papers (2024-06-13T16:15:53Z) - Folding Attention: Memory and Power Optimization for On-Device
Transformer-based Streaming Speech Recognition [19.772585241974138]
Streaming speech recognition models usually process a limited number of tokens each time.
bottleneck lies in the linear projection layers of multi-head attention and feedforward networks.
We propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency.
arXiv Detail & Related papers (2023-09-14T19:01:08Z) - Audio-Visual Speech Enhancement with Score-Based Generative Models [22.559617939136505]
This paper introduces an audio-visual speech enhancement system that leverages score-based generative models.
We exploit audio-visual embeddings obtained from a self-super-vised learning model that has been fine-tuned on lipreading.
Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality.
arXiv Detail & Related papers (2023-06-02T10:43:42Z) - Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition
With Emformer [0.4588028371034407]
A frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition.
With an average latency of 640ms, our model achieves a relative WER reduction of 6.4% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer.
arXiv Detail & Related papers (2022-03-29T14:31:06Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Speech Summarization using Restricted Self-Attention [79.89680891246827]
We introduce a single model optimized end-to-end for speech summarization.
We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos.
arXiv Detail & Related papers (2021-10-12T18:21:23Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - Knowing What, Where and When to Look: Efficient Video Action Modeling
with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos.
What-Where-When (W3) video attention module models all three facets of video attention jointly.
Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.