Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech
Synthesis
- URL: http://arxiv.org/abs/2107.12562v1
- Date: Tue, 27 Jul 2021 02:43:57 GMT
- Title: Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech
Synthesis
- Authors: Shifeng Pan and Lei He
- Abstract summary: Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale.
Existing style transfer methods are still far behind real application needs.
We propose a cross-speaker style transfer text-to-speech model with explicit prosody bottleneck.
- Score: 8.603535906880937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-speaker style transfer is crucial to the applications of multi-style
and expressive speech synthesis at scale. It does not require the target
speakers to be experts in expressing all styles and to collect corresponding
recordings for model training. However, the performances of existing style
transfer methods are still far behind real application needs. The root causes
are mainly twofold. Firstly, the style embedding extracted from single
reference speech can hardly provide fine-grained and appropriate prosody
information for arbitrary text to synthesize. Secondly, in these models the
content/text, prosody, and speaker timbre are usually highly entangled, it's
therefore not realistic to expect a satisfied result when freely combining
these components, such as to transfer speaking style between speakers. In this
paper, we propose a cross-speaker style transfer text-to-speech (TTS) model
with explicit prosody bottleneck. The prosody bottleneck builds up the kernels
accounting for speaking style robustly, and disentangles the prosody from
content and speaker timbre, therefore guarantees high quality cross-speaker
style transfer. Evaluation result shows the proposed method even achieves
on-par performance with source speaker's speaker-dependent (SD) model in
objective measurement of prosody, and significantly outperforms the cycle
consistency and GMVAE-based baselines in objective and subjective evaluations.
Related papers
- StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any
Voice Conversion using Only Speech Data [2.6217304977339473]
We propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content.
Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model.
Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks.
arXiv Detail & Related papers (2023-09-06T05:33:54Z) - Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis [37.65745551401636]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
arXiv Detail & Related papers (2023-03-14T08:52:58Z) - Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and
Speaker-wise Normalization in Speech Synthesis [37.19266733527613]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre.
Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable.
We propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels.
arXiv Detail & Related papers (2022-12-13T06:26:25Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.