Joint Audio/Text Training for Transformer Rescorer of Streaming Speech
Recognition
- URL: http://arxiv.org/abs/2211.00174v1
- Date: Mon, 31 Oct 2022 22:38:28 GMT
- Title: Joint Audio/Text Training for Transformer Rescorer of Streaming Speech
Recognition
- Authors: Suyoun Kim, Ke Li, Lucas Kabela, Rongqing Huang, Jiedan Zhu, Ozlem
Kalinli, Duc Le
- Abstract summary: We present our Joint Audio/Text training method for Transformer Rescorer.
Our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer.
- Score: 13.542483062256109
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been an increasing interest in two-pass streaming
end-to-end speech recognition (ASR) that incorporates a 2nd-pass rescoring
model on top of the conventional 1st-pass streaming ASR model to improve
recognition accuracy while keeping latency low. One of the latest 2nd-pass
rescoring model, Transformer Rescorer, takes the n-best initial outputs and
audio embeddings from the 1st-pass model, and then choose the best output by
re-scoring the n-best initial outputs. However, training this Transformer
Rescorer requires expensive paired audio-text training data because the model
uses audio embeddings as input. In this work, we present our Joint Audio/Text
training method for Transformer Rescorer, to leverage unpaired text-only data
which is relatively cheaper than paired audio-text data. We evaluate
Transformer Rescorer with our Joint Audio/Text training on Librispeech dataset
as well as our large-scale in-house dataset and show that our training method
can improve word error rate (WER) significantly compared to standard
Transformer Rescorer without requiring any extra model parameters or latency.
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - Improving Deliberation by Text-Only and Semi-Supervised Training [42.942428288428836]
We propose incorporating text-only and semi-supervised training into an attention-based deliberation model.
We achieve 4%-12% WER reduction for various tasks compared to the baseline deliberation.
We show that the deliberation model also achieves a positive human side-by-side evaluation.
arXiv Detail & Related papers (2022-06-29T15:30:44Z) - On Comparison of Encoders for Attention based End to End Speech
Recognition in Standalone and Rescoring Mode [1.7704011486040847]
Non-streaming models provide better performance as they look at the entire audio context.
We show that the Transformer model offers acceptable WER with the lowest latency requirements.
We highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER)
arXiv Detail & Related papers (2022-06-26T09:12:27Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Transformer Based Deliberation for Two-Pass Speech Recognition [46.86118010771703]
Speech recognition systems must generate words quickly while also producing accurate results.
Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate.
Previous work has established that a deliberation network can be an effective second-pass model.
arXiv Detail & Related papers (2021-01-27T18:05:22Z) - Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable
End-to-End Speech Recognition [8.046120977786702]
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR)
The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR.
We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models.
arXiv Detail & Related papers (2020-08-13T08:20:02Z) - JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech
without Explicit Alignment [2.7402733069181]
We propose Jointly trained Duration Informed Transformer (JDI-T)
JDI-T is a feed-forward Transformer with a duration predictor jointly trained without explicit alignments.
We extract the phoneme duration from the autoregressive Transformer on the fly during the joint training.
arXiv Detail & Related papers (2020-05-15T22:06:13Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.