DCTX-Conformer: Dynamic context carry-over for low latency unified
streaming and non-streaming Conformer ASR
- URL: http://arxiv.org/abs/2306.08175v2
- Date: Fri, 1 Mar 2024 21:25:16 GMT
- Title: DCTX-Conformer: Dynamic context carry-over for low latency unified
streaming and non-streaming Conformer ASR
- Authors: Goeric Huybrechts, Srikanth Ronanki, Xilai Li, Hadis Nosrati, Sravan
Bodapati, Katrin Kirchhoff
- Abstract summary: We propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art unified ASR system.
Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism.
We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.
- Score: 20.42366884075422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conformer-based end-to-end models have become ubiquitous these days and are
commonly used in both streaming and non-streaming automatic speech recognition
(ASR). Techniques like dual-mode and dynamic chunk training helped unify
streaming and non-streaming systems. However, there remains a performance gap
between streaming with a full and limited past context. To address this issue,
we propose the integration of a novel dynamic contextual carry-over mechanism
in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context
Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over
mechanism that takes into account both the left context of a chunk and one or
more preceding context embeddings. We outperform the SOTA by a relative 25.0%
word error rate, with a negligible latency impact due to the additional context
embeddings.
Related papers
- Streaming Sequence Transduction through Dynamic Compression [55.0083843520833]
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams.
STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR)
STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.
arXiv Detail & Related papers (2024-02-02T06:31:50Z) - CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming
ASR [17.999404155015647]
We propose a new framework - Chunking, Simulating Future Context and Decoding (CUSIDE) for streaming speech recognition.
A new simulation module is introduced to simulate the future contextual frames, without waiting for future context.
Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy.
arXiv Detail & Related papers (2022-03-31T02:28:48Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable
End-to-End Speech Recognition [8.046120977786702]
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR)
The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR.
We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models.
arXiv Detail & Related papers (2020-08-13T08:20:02Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.