Improving Deliberation by Text-Only and Semi-Supervised Training
- URL: http://arxiv.org/abs/2206.14716v1
- Date: Wed, 29 Jun 2022 15:30:44 GMT
- Title: Improving Deliberation by Text-Only and Semi-Supervised Training
- Authors: Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor
Strohman, Sepand Mavandadi, Weiran Wang
- Abstract summary: We propose incorporating text-only and semi-supervised training into an attention-based deliberation model.
We achieve 4%-12% WER reduction for various tasks compared to the baseline deliberation.
We show that the deliberation model also achieves a positive human side-by-side evaluation.
- Score: 42.942428288428836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-only and semi-supervised training based on audio-only data has gained
popularity recently due to the wide availability of unlabeled text and speech
data. In this work, we propose incorporating text-only and semi-supervised
training into an attention-based deliberation model. By incorporating text-only
data in training a bidirectional encoder representation from transformer (BERT)
for the deliberation text encoder, and large-scale text-to-speech and
audio-only utterances using joint acoustic and text decoder (JATD) and
semi-supervised training, we achieved 4%-12% WER reduction for various tasks
compared to the baseline deliberation. Compared to a state-of-the-art language
model (LM) rescoring method, the deliberation model reduces the Google Voice
Search WER by 11% relative. We show that the deliberation model also achieves a
positive human side-by-side evaluation compared to the state-of-the-art LM
rescorer with reasonable endpointer latencies.
Related papers
- Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Multimodal Semi-supervised Learning Framework for Punctuation Prediction
in Conversational Speech [17.602098162338137]
We explore a multimodal semi-supervised learning approach for punctuation prediction.
We learn representations from large amounts of unlabelled audio and text data.
When trained on 1 hour of speech and text data, the proposed model achieved 9-18% absolute improvement over baseline model.
arXiv Detail & Related papers (2020-08-03T08:13:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.