Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis
- URL: http://arxiv.org/abs/2303.07711v1
- Date: Tue, 14 Mar 2023 08:52:58 GMT
- Title: Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis
- Authors: Chunyu Qiang, Peng Yang, Hao Che, Ying Zhang, Xiaorui Wang, Zhongyuan
Wang
- Abstract summary: Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
- Score: 37.65745551401636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-speaker style transfer in speech synthesis aims at transferring a style
from source speaker to synthesized speech of a target speaker's timbre. In most
previous methods, the synthesized fine-grained prosody features often represent
the source speaker's average style, similar to the one-to-many problem(i.e.,
multiple prosody variations correspond to the same text). In response to this
problem, a strength-controlled semi-supervised style extractor is proposed to
disentangle the style from content and timbre, improving the representation and
interpretability of the global style embedding, which can alleviate the
one-to-many mapping and data imbalance problems in prosody prediction. A
hierarchical prosody predictor is proposed to improve prosody modeling. We find
that better style transfer can be achieved by using the source speaker's
prosody features that are easily predicted. Additionally, a
speaker-transfer-wise cycle consistency loss is proposed to assist the model in
learning unseen style-timbre combinations during the training phase.
Experimental results show that the method outperforms the baseline. We provide
a website with audio samples.
Related papers
- Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and
Speaker-wise Normalization in Speech Synthesis [37.19266733527613]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre.
Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable.
We propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels.
arXiv Detail & Related papers (2022-12-13T06:26:25Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Speaker Adaption with Intuitive Prosodic Features for Statistical
Parametric Speech Synthesis [50.5027550591763]
We propose a method of speaker adaption with intuitive prosodic features for statistical parametric speech synthesis.
The intuitive prosodic features are extracted at utterance-level or speaker-level, and are further integrated into the existing speaker-encoding-based and speaker-embedding-based adaptation frameworks respectively.
arXiv Detail & Related papers (2022-03-02T09:00:31Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech
Synthesis [8.603535906880937]
Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale.
Existing style transfer methods are still far behind real application needs.
We propose a cross-speaker style transfer text-to-speech model with explicit prosody bottleneck.
arXiv Detail & Related papers (2021-07-27T02:43:57Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis [18.812696623555855]
We present a novel few shot multi-speaker speech synthesis approach (FSM-SS)
Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner.
We demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency.
arXiv Detail & Related papers (2020-12-14T04:37:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.