The Effects of Input Type and Pronunciation Dictionary Usage in Transfer
Learning for Low-Resource Text-to-Speech
- URL: http://arxiv.org/abs/2306.00535v1
- Date: Thu, 1 Jun 2023 10:42:56 GMT
- Title: The Effects of Input Type and Pronunciation Dictionary Usage in Transfer
Learning for Low-Resource Text-to-Speech
- Authors: Phat Do, Matt Coler, Jelske Dijkstra, Esther Klabbers
- Abstract summary: We compare phone labels and articulatory features as input for cross-lingual transfer learning in text-to-speech for low-resource languages (LRLs)
Experiments with FastSpeech 2 and the LRL West Frisian show that using articulatory features outperformed using phone labels in both intelligibility and naturalness.
- Score: 1.1852406625172218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We compare phone labels and articulatory features as input for cross-lingual
transfer learning in text-to-speech (TTS) for low-resource languages (LRLs).
Experiments with FastSpeech 2 and the LRL West Frisian show that using
articulatory features outperformed using phone labels in both intelligibility
and naturalness. For LRLs without pronunciation dictionaries, we propose two
novel approaches: a) using a massively multilingual model to convert
grapheme-to-phone (G2P) in both training and synthesizing, and b) using a
universal phone recognizer to create a makeshift dictionary. Results show that
the G2P approach performs largely on par with using a ground-truth dictionary
and the phone recognition approach, while performing generally worse, remains a
viable option for LRLs less suitable for the G2P approach. Within each
approach, using articulatory features as input outperforms using phone labels.
Related papers
- Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision [16.992058149317753]
This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient automatic speech recognition (MCLASR)
We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models.
Experiments demonstrate the advantages of phoneme-based models for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency.
arXiv Detail & Related papers (2024-06-04T09:56:05Z) - Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition
and Phoneme to Grapheme Translation [9.118302330129284]
This research optimize two-pass cross-lingual transfer learning in low-resource languages.
We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics.
We introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation.
arXiv Detail & Related papers (2023-12-06T06:37:24Z) - Improving grapheme-to-phoneme conversion by learning pronunciations from
speech recordings [12.669655363646257]
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation.
We propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings.
arXiv Detail & Related papers (2023-07-31T13:25:38Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation [10.016862617549991]
This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level.
SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
arXiv Detail & Related papers (2022-07-27T01:14:59Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - Neural Machine Translation for Multilingual Grapheme-to-Phoneme
Conversion [13.543705472805431]
We present a single end-to-end trained neural G2P model that shares same encoder and decoder across multiple languages.
We show 7.2% average improvement in phoneme error rate over low resource languages and no over high resource ones compared to monolingual baselines.
arXiv Detail & Related papers (2020-06-25T06:16:29Z) - Universal Phone Recognition with a Multilingual Allophone System [135.2254086165086]
We propose a joint model of language-independent phone and language-dependent phoneme distributions.
In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute.
Our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.
arXiv Detail & Related papers (2020-02-26T21:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.