Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling
- URL: http://arxiv.org/abs/2010.06030v2
- Date: Wed, 27 Jan 2021 17:56:46 GMT
- Title: Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling
- Authors: Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N.
Sainath, Yonghui Wu, Ruoming Pang
- Abstract summary: We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
- Score: 76.43479696760996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Streaming automatic speech recognition (ASR) aims to emit each hypothesized
word as quickly and accurately as possible, while full-context ASR waits for
the completion of a full speech utterance before emitting completed hypotheses.
In this work, we propose a unified framework, Dual-mode ASR, to train a single
end-to-end ASR model with shared weights for both streaming and full-context
speech recognition. We show that the latency and accuracy of streaming ASR
significantly benefit from weight sharing and joint training of full-context
ASR, especially with inplace knowledge distillation during the training. The
Dual-mode ASR framework can be applied to recent state-of-the-art
convolution-based and transformer-based ASR networks. We present extensive
experiments with two state-of-the-art ASR networks, ContextNet and Conformer,
on two datasets, a widely used public dataset LibriSpeech and a large-scale
dataset MultiDomain. Experiments and ablation studies demonstrate that
Dual-mode ASR not only simplifies the workflow of training and deploying
streaming and full-context ASR models, but also significantly improves both
emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR,
we achieve new state-of-the-art streaming ASR results on both LibriSpeech and
MultiDomain in terms of accuracy and latency.
Related papers
- AnySR: Realizing Image Super-Resolution as Any-Scale, Any-Resource [84.74855803555677]
We introduce AnySR, to rebuild existing arbitrary-scale SR methods into any-scale, any-resource implementation.
Our AnySR innovates in: 1) building arbitrary-scale tasks as any-resource implementation, reducing resource requirements for smaller scales without additional parameters; 2) enhancing any-scale performance in a feature-interweaving fashion.
Results show that our AnySR implements SISR tasks in a computing-more-efficient fashion, and performs on par with existing arbitrary-scale SISR methods.
arXiv Detail & Related papers (2024-07-05T04:00:14Z) - Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation [44.94458898538114]
We present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation.
Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model.
arXiv Detail & Related papers (2024-05-22T10:17:30Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Learning a Dual-Mode Speech Recognition Model via Self-Pruning [18.248552732790852]
This work aims to jointly learn a compact sparse on-device streaming ASR model, and a large dense server non-streaming model, in a single supernet.
We present that, performing supernet training on both wav2vec 2.0 self-supervised learning and supervised ASR fine-tuning can not only substantially improve the large non-streaming model as shown in prior works, and also be able to improve the compact sparse streaming model.
arXiv Detail & Related papers (2022-07-25T05:03:13Z) - CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming
ASR [17.999404155015647]
We propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition.
A new simulation module is introduced to simulate the future contextual frames, without waiting for future context.
Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy.
arXiv Detail & Related papers (2022-03-31T02:28:48Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative
Adversarial Networks [10.723935272906461]
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored.
We introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective.
Our proposed approach outperforms baselines and conventional GAN-based adversarial models.
arXiv Detail & Related papers (2021-03-10T17:40:48Z) - Improving RNN Transducer Based ASR with Auxiliary Tasks [21.60022481898402]
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results.
In this work, we examine ways in which recurrent neural network transducer (RNN-T) can achieve better ASR accuracy via performing auxiliary tasks.
arXiv Detail & Related papers (2020-11-05T21:46:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.