Voice Conversion Can Improve ASR in Very Low-Resource Settings
- URL: http://arxiv.org/abs/2111.02674v1
- Date: Thu, 4 Nov 2021 07:57:00 GMT
- Title: Voice Conversion Can Improve ASR in Very Low-Resource Settings
- Authors: Matthew Baas and Herman Kamper
- Abstract summary: We study whether a VC system can be used cross-lingually to improve low-resource speech recognition.
We combine several recent techniques to design and train a practical VC system in English.
We find that when using a sensible amount of augmented data, speech recognition performance is improved in all four low-resource languages considered.
- Score: 32.170748231414365
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Voice conversion (VC) has been proposed to improve speech recognition systems
in low-resource languages by using it to augment limited training data. But
until recently, practical issues such as compute speed have limited the use of
VC for this purpose. Moreover, it is still unclear whether a VC model trained
on one well-resourced language can be applied to speech from another
low-resource language for the purpose of data augmentation. In this work we
assess whether a VC system can be used cross-lingually to improve low-resource
speech recognition. Concretely, we combine several recent techniques to design
and train a practical VC system in English, and then use this system to augment
data for training a speech recognition model in several low-resource languages.
We find that when using a sensible amount of augmented data, speech recognition
performance is improved in all four low-resource languages considered.
Related papers
- XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Learning Cross-lingual Mappings for Data Augmentation to Improve
Low-Resource Speech Recognition [31.575930914290762]
Exploiting cross-lingual resources is an effective way to compensate for data scarcity of low resource languages.
We extend the concept of learnable cross-lingual mappings for end-to-end speech recognition.
The results show that any source language ASR model can be used for a low-resource target language recognition.
arXiv Detail & Related papers (2023-06-14T15:24:31Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - A Survey of Multilingual Models for Automatic Speech Recognition [6.657361001202456]
Cross-lingual transfer is an attractive solution to the problem of multilingual Automatic Speech Recognition.
Recent advances in Self Supervised Learning are opening up avenues for unlabeled speech data to be used in multilingual ASR models.
We present best practices for building multilingual models from research across diverse languages and techniques.
arXiv Detail & Related papers (2022-02-25T09:31:40Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.