Cross-lingual Low Resource Speaker Adaptation Using Phonological
Features
- URL: http://arxiv.org/abs/2111.09075v1
- Date: Wed, 17 Nov 2021 12:33:42 GMT
- Title: Cross-lingual Low Resource Speaker Adaptation Using Phonological
Features
- Authors: Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios
Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris and Pirros
Tsiakoulis
- Abstract summary: We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages.
With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
- Score: 2.8080708404213373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The idea of using phonological features instead of phonemes as input to
sequence-to-sequence TTS has been recently proposed for zero-shot multilingual
speech synthesis. This approach is useful for code-switching, as it facilitates
the seamless uttering of foreign text embedded in a stream of native text. In
our work, we train a language-agnostic multispeaker model conditioned on a set
of phonologically derived features common across different languages, with the
goal of achieving cross-lingual speaker adaptation. We first experiment with
the effect of language phonological similarity on cross-lingual TTS of several
source-target language combinations. Subsequently, we fine-tune the model with
very limited data of a new speaker's voice in either a seen or an unseen
language, and achieve synthetic speech of equal quality, while preserving the
target speaker's identity. With as few as 32 and 8 utterances of target speaker
data, we obtain high speaker similarity scores and naturalness comparable to
the corresponding literature. In the extreme case of only 2 available
adaptation utterances, we find that our model behaves as a few-shot learner, as
the performance is similar in both the seen and unseen adaptation language
scenarios.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z) - Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker
Classifier Joint Training [6.256271702518489]
In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker.
This paper studies a multi-task learning framework to improve the cross-lingual speaker similarity.
arXiv Detail & Related papers (2022-01-20T12:02:58Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data [11.18504333789534]
We propose to use low-quality code-switched found data from the non-target speakers to achieve cross-lingual voice cloning for the target speakers.
Experiments show that our proposed method can generate high-quality code-switched speech in the target voices in terms of both naturalness and speaker consistency.
arXiv Detail & Related papers (2021-10-14T08:16:06Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.