Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and
Speaker-wise Normalization in Speech Synthesis
- URL: http://arxiv.org/abs/2212.06397v1
- Date: Tue, 13 Dec 2022 06:26:25 GMT
- Title: Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and
Speaker-wise Normalization in Speech Synthesis
- Authors: Chunyu Qiang, Peng Yang, Hao Che, Xiaorui Wang, Zhongyuan Wang
- Abstract summary: Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre.
Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable.
We propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels.
- Score: 37.19266733527613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-speaker style transfer in speech synthesis aims at transferring a style
from source speaker to synthesised speech of a target speaker's timbre. Most
previous approaches rely on data with style labels, but manually-annotated
labels are expensive and not always reliable. In response to this problem, we
propose Style-Label-Free, a cross-speaker style transfer method, which can
realize the style transfer from source speaker to target speaker without style
labels. Firstly, a reference encoder structure based on quantized variational
autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style
representations. Secondly, a speaker-wise batch normalization layer is proposed
to reduce the source speaker leakage. In order to improve the style extraction
ability of the reference encoder, a style invariant and contrastive data
augmentation method is proposed. Experimental results show that the method
outperforms the baseline. We provide a website with audio samples.
Related papers
- Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training [3.9306467064810438]
One-shot voice conversion aims to change the timbre of any source speech to match that of the target speaker with only one speech sample.
Existing style transfer-based VC methods relied on speech representation disentanglement.
We propose Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder.
arXiv Detail & Related papers (2024-09-03T07:21:19Z) - StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any
Voice Conversion using Only Speech Data [2.6217304977339473]
We propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content.
Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model.
Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks.
arXiv Detail & Related papers (2023-09-06T05:33:54Z) - ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style
Transfer [57.6482608202409]
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning.
We introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles.
We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
arXiv Detail & Related papers (2023-08-29T17:36:02Z) - Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis [37.65745551401636]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
arXiv Detail & Related papers (2023-03-14T08:52:58Z) - Towards Cross-speaker Reading Style Transfer on Audiobook Dataset [43.99232352300273]
Cross-speaker style transfer aims to extract the speech style of the given reference speech.
audiobook datasets are typically characterized by both the local prosody and global genre.
arXiv Detail & Related papers (2022-08-10T14:08:35Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech
Synthesis [8.603535906880937]
Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale.
Existing style transfer methods are still far behind real application needs.
We propose a cross-speaker style transfer text-to-speech model with explicit prosody bottleneck.
arXiv Detail & Related papers (2021-07-27T02:43:57Z) - Exploring Contextual Word-level Style Relevance for Unsupervised Style
Transfer [60.07283363509065]
Unsupervised style transfer aims to change the style of an input sentence while preserving its original content.
We propose a novel attentional sequence-to-sequence model that exploits the relevance of each output word to the target style.
Experimental results show that our proposed model achieves state-of-the-art performance in terms of both transfer accuracy and content preservation.
arXiv Detail & Related papers (2020-05-05T10:24:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.