Streaming parallel transducer beam search with fast-slow cascaded
encoders
- URL: http://arxiv.org/abs/2203.15773v1
- Date: Tue, 29 Mar 2022 17:29:39 GMT
- Title: Streaming parallel transducer beam search with fast-slow cascaded
encoders
- Authors: Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas
Chandra, Ozlem Kalinli, Michael L Seltzer
- Abstract summary: Streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.
We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders.
- Score: 23.416682253435837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Streaming ASR with strict latency constraints is required in many speech
recognition applications. In order to achieve the required latency, streaming
ASR models sacrifice accuracy compared to non-streaming ASR models due to lack
of future input context. Previous research has shown that streaming and
non-streaming ASR for RNN Transducers can be unified by cascading causal and
non-causal encoders. This work improves upon this cascaded encoders framework
by leveraging two streaming non-causal encoders with variable input context
sizes that can produce outputs at different audio intervals (e.g. fast and
slow). We propose a novel parallel time-synchronous beam search algorithm for
transducers that decodes from fast-slow encoders, where the slow encoder
corrects the mistakes generated from the fast encoder. The proposed algorithm,
achieves up to 20% WER reduction with a slight increase in token emission
delays on the public Librispeech dataset and in-house datasets. We also explore
techniques to reduce the computation by distributing processing between the
fast and slow encoders. Lastly, we explore sharing the parameters in the fast
encoder to reduce the memory footprint. This enables low latency processing on
edge devices with low computation cost and a low memory footprint.
Related papers
- Local Clustering Decoder: a fast and adaptive hardware decoder for the surface code [0.0]
We introduce the Local Clustering Decoder as a solution that simultaneously achieves the accuracy and speed requirements of a real-time decoding system.
Our decoder is implemented on FPGAs and exploits hardware parallelism to keep pace with the fastest qubit types.
It enables one million error-free quantum operations with 4x fewer physical qubits when compared to standard non-adaptive decoding.
arXiv Detail & Related papers (2024-11-15T16:43:59Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - Practical Conformer: Optimizing size, speed and flops of Conformer for
on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs.
Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - Streaming Align-Refine for Non-autoregressive Deliberation [42.748839817396046]
We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model.
Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context.
Experiments on voice search datasets and Librispeech show that with reasonable right context, our streaming model performs as well as the offline counterpart.
arXiv Detail & Related papers (2022-04-15T17:24:39Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable
End-to-End Speech Recognition [8.046120977786702]
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR)
The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR.
We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models.
arXiv Detail & Related papers (2020-08-13T08:20:02Z) - Minimum Latency Training Strategies for Streaming Sequence-to-Sequence
ASR [44.229256049718316]
Streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity.
In these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information.
We propose several strategies during training by leveraging external hard alignments extracted from the hybrid model.
Experiments on the Cortana voice search task demonstrate that our proposed methods can significantly reduce the latency, and even improve the recognition accuracy in certain cases on the decoder side.
arXiv Detail & Related papers (2020-04-10T12:24:49Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.