Revisiting IPA-based Cross-lingual Text-to-speech
- URL: http://arxiv.org/abs/2110.07187v2
- Date: Mon, 18 Oct 2021 10:36:08 GMT
- Title: Revisiting IPA-based Cross-lingual Text-to-speech
- Authors: Haitong Zhang, Haoyue Zhan, Yang Zhang, Xinyuan Yu, Yue Lin
- Abstract summary: International Phonetic Alphabet (IPA) has been widely used in cross-lingual text-to-speech (TTS) to achieve cross-lingual voice cloning (CL VC)
In this paper, we report some empirical findings of building a cross-lingual TTS model using IPA as inputs.
Experiments show that the way to process the IPA and suprasegmental sequence has a negligible impact on the CL VC performance.
- Score: 11.010299086810994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: International Phonetic Alphabet (IPA) has been widely used in cross-lingual
text-to-speech (TTS) to achieve cross-lingual voice cloning (CL VC). However,
IPA itself has been understudied in cross-lingual TTS. In this paper, we report
some empirical findings of building a cross-lingual TTS model using IPA as
inputs. Experiments show that the way to process the IPA and suprasegmental
sequence has a negligible impact on the CL VC performance. Furthermore, we find
that using a dataset including one speaker per language to build an IPA-based
TTS system would fail CL VC since the language-unique IPA and tone/stress
symbols could leak the speaker information. In addition, we experiment with
different combinations of speakers in the training dataset to further
investigate the effect of the number of speakers on the CL VC performance.
Related papers
- Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT [29.167336994990542]
Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects.
We present a novel TTS model comprising three sub-modules to perform competitively at this task.
arXiv Detail & Related papers (2024-09-11T13:40:27Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision [16.992058149317753]
This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient automatic speech recognition (MCLASR)
We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models.
Experiments demonstrate the advantages of phoneme-based models for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency.
arXiv Detail & Related papers (2024-06-04T09:56:05Z) - The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language [7.0944623704102625]
We show that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages.
We propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences.
arXiv Detail & Related papers (2023-11-14T17:09:07Z) - Towards a Deep Understanding of Multilingual End-to-End Speech
Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages.
We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Applying Feature Underspecified Lexicon Phonological Features in
Multilingual Text-to-Speech [1.9688095374610102]
We present a mapping of ARPABET/pinyin to SAMPA/SAMPA-SC and then to phonological features.
This mapping was tested for whether it could lead to the successful generation of native, non-native, and code-switched speech in the two languages.
arXiv Detail & Related papers (2022-04-14T21:04:55Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.