On Comparison of Encoders for Attention based End to End Speech
Recognition in Standalone and Rescoring Mode
- URL: http://arxiv.org/abs/2206.12829v1
- Date: Sun, 26 Jun 2022 09:12:27 GMT
- Title: On Comparison of Encoders for Attention based End to End Speech
Recognition in Standalone and Rescoring Mode
- Authors: Raviraj Joshi, Subodh Kumar
- Abstract summary: Non-streaming models provide better performance as they look at the entire audio context.
We show that the Transformer model offers acceptable WER with the lowest latency requirements.
We highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER)
- Score: 1.7704011486040847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The streaming automatic speech recognition (ASR) models are more popular and
suitable for voice-based applications. However, non-streaming models provide
better performance as they look at the entire audio context. To leverage the
benefits of the non-streaming model in streaming applications like voice
search, it is commonly used in second pass re-scoring mode. The candidate
hypothesis generated using steaming models is re-scored using a non-streaming
model. In this work, we evaluate the non-streaming attention-based end-to-end
ASR models on the Flipkart voice search task in both standalone and re-scoring
modes. These models are based on Listen-Attend-Spell (LAS) encoder-decoder
architecture. We experiment with different encoder variations based on LSTM,
Transformer, and Conformer. We compare the latency requirements of these models
along with their performance. Overall we show that the Transformer model offers
acceptable WER with the lowest latency requirements. We report a relative WER
improvement of around 16% with the second pass LAS re-scoring with latency
overhead under 5ms. We also highlight the importance of CNN front-end with
Transformer architecture to achieve comparable word error rates (WER).
Moreover, we observe that in the second pass re-scoring mode all the encoders
provide similar benefits whereas the difference in performance is prominent in
standalone text generation mode.
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Joint Audio/Text Training for Transformer Rescorer of Streaming Speech
Recognition [13.542483062256109]
We present our Joint Audio/Text training method for Transformer Rescorer.
Our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer.
arXiv Detail & Related papers (2022-10-31T22:38:28Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech
Recognition [19.971343876930767]
We present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model.
Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified.
Experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently.
arXiv Detail & Related papers (2020-12-10T06:54:54Z) - Cascaded encoders for unifying streaming and non-streaming ASR [68.62941009369125]
This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously.
A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder.
Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode.
arXiv Detail & Related papers (2020-10-27T20:59:50Z) - Transformer Transducer: One Model Unifying Streaming and Non-streaming
Speech Recognition [16.082949461807335]
We present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model.
We show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes.
This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy.
arXiv Detail & Related papers (2020-10-07T05:58:28Z) - A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer.
We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.