We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings
- URL: http://arxiv.org/abs/2407.04291v1
- Date: Fri, 5 Jul 2024 06:54:24 GMT
- Title: We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings
- Authors: Ismail Rasim Ulgen, Carlos Busso, John H. L. Hansen, Berrak Sisman,
- Abstract summary: In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech.
We propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings.
- Score: 47.2515056854372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings. The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center. We apply our proposed embedding in voice conversion task and show that our method provides better naturalness and prosody in synthesized speech.
Related papers
- ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis [5.824018496599849]
We propose a novel method for modeling numerous speakers.
It enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model.
arXiv Detail & Related papers (2023-11-20T13:13:24Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis [37.65745551401636]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
arXiv Detail & Related papers (2023-03-14T08:52:58Z) - ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly
Disentangled Self-supervised Speech Representations [12.20522794248598]
We propose a zero-shot voice conversion method using speech representations trained with self-supervised learning.
We develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style.
Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its representation.
arXiv Detail & Related papers (2023-02-16T08:10:41Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent
Representations [22.14238843571225]
We propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face.
The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images.
We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results.
arXiv Detail & Related papers (2021-07-26T07:36:02Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Noise Robust TTS for Low Resource Speakers using Pre-trained Model and
Speech Enhancement [31.33429812278942]
The proposed end-to-end speech synthesis model uses both speaker embedding and noise representation as conditional inputs to model speaker and noise information respectively.
Experimental results show that the speech generated by the proposed approach has better subjective evaluation results than the method directly fine-tuning multi-speaker speech synthesis model.
arXiv Detail & Related papers (2020-05-26T06:14:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.