VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
- URL: http://arxiv.org/abs/2107.07509v1
- Date: Thu, 15 Jul 2021 17:59:10 GMT
- Title: VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
- Authors: Hirofumi Inaguma, Tatsuya Kawahara
- Abstract summary: We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches.
We also propose a VAD-free inference algorithm that leverages probabilities to determine a suitable timing to reset the model states.
Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one.
- Score: 46.69852287267763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose novel decoding algorithms to enable streaming
automatic speech recognition (ASR) on unsegmented long-form recordings without
voice activity detection (VAD), based on monotonic chunkwise attention (MoChA)
with an auxiliary connectionist temporal classification (CTC) objective. We
propose a block-synchronous beam search decoding to take advantage of efficient
batched output-synchronous and low-latency input-synchronous searches. We also
propose a VAD-free inference algorithm that leverages CTC probabilities to
determine a suitable timing to reset the model states to tackle the
vulnerability to long-form data. Experimental evaluations demonstrate that the
block-synchronous decoding achieves comparable accuracy to the
label-synchronous one. Moreover, the VAD-free inference can recognize long-form
speech robustly for up to a few hours.
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - A Demonstration of Over-the-Air Computation for Federated Edge Learning [8.22379888383833]
The proposed method relies on the detection of a synchronization waveform in both receive and transmit directions.
By implementing this synchronization method on a set of low-cost SDRs, we demonstrate the performance of frequency-shift keying (FSK)-based majority vote (MV)
arXiv Detail & Related papers (2022-09-20T19:08:49Z) - An Investigation of Enhancing CTC Model for Triggered Attention-based
Streaming ASR [19.668440671541546]
An attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system.
The proposed method achieves higher accuracy with lower latency than the conventional triggered attention-based streaming ASR system.
arXiv Detail & Related papers (2021-10-20T06:44:58Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Alignment Knowledge Distillation for Online Streaming Attention-based
Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems.
The proposed method significantly reduces recognition errors and emission latency simultaneously.
The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.