Semi-supervised transfer learning for language expansion of end-to-end
speech recognition models to low-resource languages
- URL: http://arxiv.org/abs/2111.10047v1
- Date: Fri, 19 Nov 2021 05:09:16 GMT
- Title: Semi-supervised transfer learning for language expansion of end-to-end
speech recognition models to low-resource languages
- Authors: Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim
- Abstract summary: We propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages.
We leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS augmentation, and SSL respectively.
Overall, our two-pass speech recognition system with a Monotonic Chunkwise Attention (MoA) in the first pass achieves a WER reduction of 42% relative to the baseline.
- Score: 19.44975351652865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a three-stage training methodology to improve the
speech recognition accuracy of low-resource languages. We explore and propose
an effective combination of techniques such as transfer learning, encoder
freezing, data augmentation using Text-To-Speech (TTS), and Semi-Supervised
Learning (SSL). To improve the accuracy of a low-resource Italian ASR, we
leverage a well-trained English model, unlabeled text corpus, and unlabeled
audio corpus using transfer learning, TTS augmentation, and SSL respectively.
In the first stage, we use transfer learning from a well-trained English model.
This primarily helps in learning the acoustic information from a resource-rich
language. This stage achieves around 24% relative Word Error Rate (WER)
reduction over the baseline. In stage two, We utilize unlabeled text data via
TTS data-augmentation to incorporate language information into the model. We
also explore freezing the acoustic encoder at this stage. TTS data augmentation
helps us further reduce the WER by ~ 21% relatively. Finally, In stage three we
reduce the WER by another 4% relative by using SSL from unlabeled audio data.
Overall, our two-pass speech recognition system with a Monotonic Chunkwise
Attention (MoChA) in the first pass and a full-attention in the second pass
achieves a WER reduction of ~ 42% relative to the baseline.
Related papers
- Rapid Speaker Adaptation in Low Resource Text to Speech Systems using
Synthetic Data and Transfer learning [6.544954579068865]
We propose a transfer learning approach using high-resource language data and synthetically generated data.
We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi.
arXiv Detail & Related papers (2023-12-02T10:52:00Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - Improving Cross-Lingual Transfer Learning for End-to-End Speech
Recognition with Speech Translation [63.16500026845157]
We introduce speech-to-text translation as an auxiliary task to incorporate additional knowledge of the target language.
We show that training ST with human translations is not necessary.
Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.
arXiv Detail & Related papers (2020-06-09T19:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.