ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion
- URL: http://arxiv.org/abs/2204.00618v5
- Date: Sat, 20 May 2023 16:59:28 GMT
- Title: ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion
- Authors: Edresson Casanova, Christopher Shulby, Alexander Korolev, Arnaldo
Candido Junior, Anderson da Silva Soares, Sandra Alu\'isio, Moacir Antonelli
Ponti
- Abstract summary: We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
- Score: 49.617722668505834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore cross-lingual multi-speaker speech synthesis and cross-lingual
voice conversion applied to data augmentation for automatic speech recognition
(ASR) systems in low/medium-resource scenarios. Through extensive experiments,
we show that our approach permits the application of speech synthesis and voice
conversion to improve ASR systems using only one target-language speaker during
model training. We also managed to close the gap between ASR models trained
with synthesized versus human speech compared to other works that use many
speakers. Finally, we show that it is possible to obtain promising ASR training
results with our data augmentation method using only a single real speaker in a
target language.
Related papers
- Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models [21.85677682584916]
speculative speech recognition (SSR)
We propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-ed language model (LM)
arXiv Detail & Related papers (2024-07-05T16:52:55Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for
Robust Polyglot Text-To-Speech [6.243356997302935]
We introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model.
In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker.
In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model.
arXiv Detail & Related papers (2023-09-15T09:03:14Z) - Adapting Multi-Lingual ASR Models for Handling Multiple Talkers [63.151811561972515]
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
We propose an approach to adapt USMs for multi-talker ASR.
We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction.
arXiv Detail & Related papers (2023-05-30T05:05:52Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data [11.18504333789534]
We propose to use low-quality code-switched found data from the non-target speakers to achieve cross-lingual voice cloning for the target speakers.
Experiments show that our proposed method can generate high-quality code-switched speech in the target voices in terms of both naturalness and speaker consistency.
arXiv Detail & Related papers (2021-10-14T08:16:06Z) - An Adaptive Learning based Generative Adversarial Network for One-To-One
Voice Conversion [9.703390665821463]
We propose an adaptive learning-based GAN model called ALGAN-VC for an efficient one-to-one VC of speakers.
The model is tested on Voice Conversion Challenge (VCC) 2016, 2018, and 2020 datasets as well as on our self-prepared speech dataset.
A subjective and objective evaluation of the generated speech samples indicated that the proposed model elegantly performed the voice conversion task.
arXiv Detail & Related papers (2021-04-25T13:44:32Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.