Alignment Knowledge Distillation for Online Streaming Attention-based
Speech Recognition
- URL: http://arxiv.org/abs/2103.00422v1
- Date: Sun, 28 Feb 2021 08:17:38 GMT
- Title: Alignment Knowledge Distillation for Online Streaming Attention-based
Speech Recognition
- Authors: Hirofumi Inaguma, Tatsuya Kawahara
- Abstract summary: This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems.
The proposed method significantly reduces recognition errors and emission latency simultaneously.
The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
- Score: 46.69852287267763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This article describes an efficient training method for online streaming
attention-based encoder-decoder (AED) automatic speech recognition (ASR)
systems. AED models have achieved competitive performance in offline scenarios
by jointly optimizing all components. They have recently been extended to an
online streaming framework via models such as monotonic chunkwise attention
(MoChA). However, the elaborate attention calculation process is not robust for
long-form speech utterances. Moreover, the sequence-level training objective
and time-restricted streaming encoder cause a nonnegligible delay in token
emission during inference. To address these problems, we propose CTC
synchronous training (CTC-ST), in which CTC alignments are leveraged as a
reference for token boundaries to enable a MoChA model to learn optimal
monotonic input-output alignments. We formulate a purely end-to-end training
objective to synchronize the boundaries of MoChA to those of CTC. The CTC model
shares an encoder with the MoChA model to enhance the encoder representation.
Moreover, the proposed method provides alignment information learned in the CTC
branch to the attention-based decoder. Therefore, CTC-ST can be regarded as
self-distillation of alignment knowledge from CTC to MoChA. Experimental
evaluations on a variety of benchmark datasets show that the proposed method
significantly reduces recognition errors and emission latency simultaneously,
especially for long-form and noisy speech. We also compare CTC-ST with several
methods that distill alignment knowledge from a hybrid ASR system and show that
the CTC-ST can achieve a comparable tradeoff of accuracy and latency without
relying on external alignment information. The best MoChA system shows
performance comparable to that of RNN-transducer (RNN-T).
Related papers
- CR-CTC: Consistency regularization on CTC for improved speech recognition [18.996929774821822]
Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR)
However, it often falls short in recognition performance compared to transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED)
We propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram.
arXiv Detail & Related papers (2024-10-07T14:56:07Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework.
Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z) - An Investigation of Enhancing CTC Model for Triggered Attention-based
Streaming ASR [19.668440671541546]
An attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system.
The proposed method achieves higher accuracy with lower latency than the conventional triggered attention-based streaming ASR system.
arXiv Detail & Related papers (2021-10-20T06:44:58Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording [46.69852287267763]
We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches.
We also propose a VAD-free inference algorithm that leverages probabilities to determine a suitable timing to reset the model states.
Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one.
arXiv Detail & Related papers (2021-07-15T17:59:10Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - CTC-synchronous Training for Monotonic Attention Model [43.0382262234792]
backward probabilities cannot be leveraged in the alignment process during training due to left-to-right dependency in the decoder.
We propose CTC-synchronous training ( CTC-ST), in which MoChA uses CTC alignments to learn optimal monotonic alignments.
The entire model is jointly optimized so that the expected boundaries from MoChA are synchronized with the alignments.
arXiv Detail & Related papers (2020-05-10T16:48:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.