CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming
ASR
- URL: http://arxiv.org/abs/2203.16758v1
- Date: Thu, 31 Mar 2022 02:28:48 GMT
- Title: CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming
ASR
- Authors: Keyu An and Huahuan Zheng and Zhijian Ou and Hongyu Xiang and Ke Ding
and Guanglu Wan
- Abstract summary: We propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition.
A new simulation module is introduced to simulate the future contextual frames, without waiting for future context.
Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy.
- Score: 17.999404155015647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: History and future contextual information are known to be important for
accurate acoustic modeling. However, acquiring future context brings latency
for streaming ASR. In this paper, we propose a new framework - Chunking,
Simulating Future Context and Decoding (CUSIDE) for streaming speech
recognition. A new simulation module is introduced to recursively simulate the
future contextual frames, without waiting for future context. The simulation
module is jointly trained with the ASR model using a self-supervised loss; the
ASR model is optimized with the usual ASR loss, e.g., CTC-CRF as used in our
experiments. Experiments show that, compared to using real future frames as
right context, using simulated future context can drastically reduce latency
while maintaining recognition accuracy. With CUSIDE, we obtain new
state-of-the-art streaming ASR results on the AISHELL-1 dataset.
Related papers
- DCTX-Conformer: Dynamic context carry-over for low latency unified
streaming and non-streaming Conformer ASR [20.42366884075422]
We propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art unified ASR system.
Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism.
We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.
arXiv Detail & Related papers (2023-06-13T23:42:53Z) - A Lexical-aware Non-autoregressive Transformer-based ASR Model [9.500518278458905]
We propose a lexical-aware non-autoregressive Transformer-based (LA-NAT) ASR framework, which consists of an acoustic encoder, a speech-text shared encoder, and a speech-text shared decoder.
LA-NAT aims to make the ASR model aware of lexical information, so the resulting model is expected to achieve better results by leveraging the learned linguistic knowledge.
arXiv Detail & Related papers (2023-05-18T09:50:47Z) - Synthetic Wave-Geometric Impulse Responses for Improved Speech
Dereverberation [69.1351513309953]
We show that accurately simulating the low-frequency components of Room Impulse Responses (RIRs) is important to achieving good dereverberation.
We demonstrate that speech dereverberation models trained on hybrid synthetic RIRs outperform models trained on RIRs generated by prior geometric ray tracing methods.
arXiv Detail & Related papers (2022-12-10T20:15:23Z) - SimOn: A Simple Framework for Online Temporal Action Localization [51.27476730635852]
We propose a framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture.
Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods.
arXiv Detail & Related papers (2022-11-08T04:50:54Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Multi-mode Transformer Transducer with Stochastic Future Context [53.005638503544866]
Multi-mode speech recognition models can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can still achieve reliable accuracy.
We show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.
arXiv Detail & Related papers (2021-06-17T18:42:11Z) - Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z) - Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative
Adversarial Networks [10.723935272906461]
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored.
We introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective.
Our proposed approach outperforms baselines and conventional GAN-based adversarial models.
arXiv Detail & Related papers (2021-03-10T17:40:48Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.