EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition
- URL: http://arxiv.org/abs/2104.07474v1
- Date: Tue, 13 Apr 2021 23:18:25 GMT
- Title: EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition
- Authors: Murali Karthick Baskar, Luk\'a\v{s} Burget, Shinji Watanabe, Ramon
Fernandez Astudillo, and Jan "Honza'' \v{C}ernock\'y
- Abstract summary: We propose an enhanced ASR-TTS (EAT) model that incorporates two main features.
EAT reduces the performance gap between supervised and self-supervised training significantly by absolute 2.6% and 2.7% on Librispeech and BABEL respectively.
- Score: 43.702644305349054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised ASR-TTS models suffer in out-of-domain data conditions. Here
we propose an enhanced ASR-TTS (EAT) model that incorporates two main features:
1) The ASR$\rightarrow$TTS direction is equipped with a language model reward
to penalize the ASR hypotheses before forwarding it to TTS. 2) In the
TTS$\rightarrow$ASR direction, a hyper-parameter is introduced to scale the
attention context from synthesized speech before sending it to ASR to handle
out-of-domain data. Training strategies and the effectiveness of the EAT model
are explored under out-of-domain data conditions. The results show that EAT
reduces the performance gap between supervised and self-supervised training
significantly by absolute 2.6\% and 2.7\% on Librispeech and BABEL
respectively.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and
Self-training of Neural Transducer [20.8850874806462]
This paper proposes a new approach to perform unsupervised fine-tuning and self-training using unlabeled speech data.
For the fine-tuning task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data.
For the self-training task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data.
arXiv Detail & Related papers (2022-07-29T15:14:03Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative
Adversarial Networks [10.723935272906461]
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored.
We introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective.
Our proposed approach outperforms baselines and conventional GAN-based adversarial models.
arXiv Detail & Related papers (2021-03-10T17:40:48Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.