Data-augmented cross-lingual synthesis in a teacher-student framework
- URL: http://arxiv.org/abs/2204.00061v1
- Date: Thu, 31 Mar 2022 20:01:32 GMT
- Title: Data-augmented cross-lingual synthesis in a teacher-student framework
- Authors: Marcel de Korte, Jaebok Kim, Aki Kunikoshi, Adaeze Adigwe, Esther
Klabbers
- Abstract summary: Cross-lingual synthesis is the task of letting a speaker generate fluent synthetic speech in another language.
Previous research shows that many models appear to have insufficient generalization capabilities.
We propose to apply the teacher-student paradigm to cross-lingual synthesis.
- Score: 3.2548794659022398
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Cross-lingual synthesis can be defined as the task of letting a speaker
generate fluent synthetic speech in another language. This is a challenging
task, and resulting speech can suffer from reduced naturalness, accented
speech, and/or loss of essential voice characteristics. Previous research shows
that many models appear to have insufficient generalization capabilities to
perform well on every of these cross-lingual aspects. To overcome these
generalization problems, we propose to apply the teacher-student paradigm to
cross-lingual synthesis. While a teacher model is commonly used to produce
teacher forced data, we propose to also use it to produce augmented data of
unseen speaker-language pairs, where the aim is to retain essential speaker
characteristics. Both sets of data are then used for student model training,
which is trained to retain the naturalness and prosodic variation present in
the teacher forced data, while learning the speaker identity from the augmented
data. Some modifications to the student model are proposed to make the
separation of teacher forced and augmented data more straightforward. Results
show that the proposed approach improves the retention of speaker
characteristics in the speech, while managing to retain high levels of
naturalness and prosodic variation.
Related papers
- DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage [7.096838107088313]
DisfluencySpeech is a studio-quality labeled English speech dataset with paralanguage.
A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1 Telephone Speech Corpus (Switchboard)
arXiv Detail & Related papers (2024-06-13T05:23:22Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Text is All You Need: Personalizing ASR Models using Controllable Speech
Synthesis [17.172909510518814]
Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data.
Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis.
arXiv Detail & Related papers (2023-03-27T02:50:02Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Multilingual Multiaccented Multispeaker TTS with RADTTS [21.234787964238645]
We present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS.
We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents.
arXiv Detail & Related papers (2023-01-24T22:39:04Z) - Speaker Information Can Guide Models to Better Inductive Biases: A Case
Study On Predicting Code-Switching [27.68274308680201]
We show that adding sociolinguistically-grounded speaker features as prepended prompts significantly improves accuracy.
We are the first to incorporate speaker characteristics in a neural model for code-switching.
arXiv Detail & Related papers (2022-03-16T22:56:58Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z) - From Speaker Verification to Multispeaker Speech Synthesis, Deep
Transfer with Feedback Constraint [11.982748481062542]
This paper presents a system involving feedback constraint for multispeaker speech synthesis.
We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network.
The model is trained and evaluated on publicly available datasets.
arXiv Detail & Related papers (2020-05-10T06:11:37Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.