DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech
- URL: http://arxiv.org/abs/2410.13342v1
- Date: Thu, 17 Oct 2024 08:51:46 GMT
- Title: DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech
- Authors: Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans,
- Abstract summary: We propose a novel approach to disentangle speaker and accent representations using multi-level variational autoencoders (ML-VAE) and vector quantization (VQ)
Our proposed method addresses the challenge of effectively separating speaker and accent characteristics, enabling more fine-grained control over the synthesized speech.
- Score: 14.323313455208183
- License:
- Abstract: Recent advancements in Text-to-Speech (TTS) systems have enabled the generation of natural and expressive speech from textual input. Accented TTS aims to enhance user experience by making the synthesized speech more relatable to minority group listeners, and useful across various applications and context. Speech synthesis can further be made more flexible by allowing users to choose any combination of speaker identity and accent, resulting in a wide range of personalized speech outputs. Current models struggle to disentangle speaker and accent representation, making it difficult to accurately imitate different accents while maintaining the same speaker characteristics. We propose a novel approach to disentangle speaker and accent representations using multi-level variational autoencoders (ML-VAE) and vector quantization (VQ) to improve flexibility and enhance personalization in speech synthesis. Our proposed method addresses the challenge of effectively separating speaker and accent characteristics, enabling more fine-grained control over the synthesized speech. Code and speech samples are publicly available.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training [14.323313455208183]
Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent.
We propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion.
arXiv Detail & Related papers (2024-06-03T05:56:02Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech
Synthesis and Editing [31.666920933058144]
We propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training.
Experiments show A$3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.
arXiv Detail & Related papers (2022-03-18T01:36:25Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis [18.812696623555855]
We present a novel few shot multi-speaker speech synthesis approach (FSM-SS)
Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner.
We demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency.
arXiv Detail & Related papers (2020-12-14T04:37:07Z) - From Speaker Verification to Multispeaker Speech Synthesis, Deep
Transfer with Feedback Constraint [11.982748481062542]
This paper presents a system involving feedback constraint for multispeaker speech synthesis.
We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network.
The model is trained and evaluated on publicly available datasets.
arXiv Detail & Related papers (2020-05-10T06:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.