Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training
- URL: http://arxiv.org/abs/2407.13292v2
- Date: Mon, 16 Sep 2024 08:23:30 GMT
- Title: Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training
- Authors: Lukuan Dong, Donghong Qin, Fengbo Bai, Fanhua Song, Yan Liu, Chen Xu, Zhijian Ou,
- Abstract summary: With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition.
It is found that phoneme supervision can achieve better results compared to subword supervision and self-supervision.
- Score: 12.460425460648516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The mainstream automatic speech recognition (ASR) technology usually requires hundreds to thousands of hours of annotated speech data. Three approaches to low-resourced ASR are phoneme or subword based supervised pre-training, and self-supervised pre-training over multilingual data. The Iu Mien language is the main ethnic language of the Yao ethnic group in China and is low-resourced in the sense that the annotated speech is very limited. With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. Our experiments are based on the recently released, three backbone models pretrained over the 10 languages from the CommonVoice dataset (CV-Lang10), which correspond to the three approaches for low-resourced ASR. It is found that phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. Particularly, the Whistle models, i.e., obtained by the weakly-supervised phoneme-based multilingual pre-training, obtain the most competitive results.
Related papers
- Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision [16.992058149317753]
This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient automatic speech recognition (MCLASR)
We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models.
Experiments demonstrate the advantages of phoneme-based models for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency.
arXiv Detail & Related papers (2024-06-04T09:56:05Z) - Language-Universal Phonetic Representation in Multilingual Speech
Pretraining for Low-Resource Speech Recognition [28.21805271848413]
We leverage an International Phonetic Alphabet (IPA) multilingual model to create frame-level pseudo labels for unlabeled speech.
We use these pseudo labels to guide hidden-unit BERT (HuBERT) based speech pretraining in a phonetically-informed manner.
Our approach outperforms most of the state of the arts, with much less pretraining data in terms of hours and language diversity.
arXiv Detail & Related papers (2023-05-19T10:15:11Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - Multilingual Zero Resource Speech Recognition Base on Self-Supervise
Pre-Trained Acoustic Models [14.887781621924255]
This paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition.
It is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts.
Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages.
arXiv Detail & Related papers (2022-10-13T12:11:18Z) - Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with
Articulatory Features [30.37026279162593]
In this work, we use embeddings derived from articulatory vectors rather than embeddings derived from phoneme identities to learn phoneme representations that hold across languages.
This enables us to fine-tune a high-quality text-to-speech model on just 30 minutes of data in a previously unseen language spoken by a previously unseen speaker.
arXiv Detail & Related papers (2022-03-07T07:58:01Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.