Related papers: Delay-penalized transducer for low-latency streaming ASR

Delay-penalized transducer for low-latency streaming ASR

URL: http://arxiv.org/abs/2211.00490v1
Date: Mon, 31 Oct 2022 07:03:50 GMT
Title: Delay-penalized transducer for low-latency streaming ASR
Authors: Wei Kang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Long lin, Piotr \.Zelasko, Daniel Povey
Abstract summary: We propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments. Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay.
Score: 26.39851372961386
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments. Specifically, our method adds a small constant times (T/2 - t), where T is the number of frames and t is the current frame, to all the non-blank log-probabilities (after normalization) that are fed into the two dimensional transducer recursion. For both streaming Conformer models and unidirectional long short-term memory (LSTM) models, experimental results show that it can significantly reduce the symbol delay with an acceptable performance degradation. Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay. Our work is open-sourced and publicly available (https://github.com/k2-fsa/k2).

Related papers

Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency [0.9999629695552195]
We use gradient modeling to better understand the impact of design choices in asynchronous FL algorithms. We characterize in particular a fundamental trade-off for optimizing asynchronous FL. We show that these optimizations enhance accuracy by 10% to 30%.
arXiv Detail & Related papers (2025-02-12T08:38:13Z)
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [52.56469577812338]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.<n>Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.<n>We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z)
Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner [84.97253871387028]
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed. We propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost. Experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods.
arXiv Detail & Related papers (2023-10-14T02:19:07Z)
Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context. Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z)
Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z)
Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition [38.28868751443619]
We propose a new training method to explicitly model and reduce the latency of sequence transducer models. Experimental results show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%.
arXiv Detail & Related papers (2022-11-04T09:19:59Z)
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. We propose a novel end-to-end streaming NAR speech recognition system. We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)
Reducing Streaming ASR Model Delay with Self Alignment [20.61461084287351]
Constrained alignment is a well-known existing approach that penalizes predicted word boundaries using external low-latency acoustic models. FastEmit is a sequence-level delay regularization scheme encouraging vocabulary tokens over blanks without any reference alignments. In this paper, we propose a novel delay constraining method, named self alignment.
arXiv Detail & Related papers (2021-05-06T18:00:11Z)
Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition [19.971343876930767]
We present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. Experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently.
arXiv Detail & Related papers (2020-12-10T06:54:54Z)
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z)
Boosting Continuous Sign Language Recognition via Cross Modality Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair. We propose a novel architecture with cross modality augmentation. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR [44.229256049718316]
Streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. In these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information. We propose several strategies during training by leveraging external hard alignments extracted from the hybrid model. Experiments on the Cortana voice search task demonstrate that our proposed methods can significantly reduce the latency, and even improve the recognition accuracy in certain cases on the decoder side.
arXiv Detail & Related papers (2020-04-10T12:24:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.