Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any
Voice Conversion using Only Speech Data
- URL: http://arxiv.org/abs/2309.02730v3
- Date: Fri, 15 Dec 2023 04:36:41 GMT
- Title: Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any
Voice Conversion using Only Speech Data
- Authors: Hyungseob Lim, Kyungguen Byun, Sunkuk Moon, Erik Visser
- Abstract summary: We propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content.
Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model.
Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks.
- Score: 2.6217304977339473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While many recent any-to-any voice conversion models succeed in transferring
some target speech's style information to the converted speech, they still lack
the ability to faithfully reproduce the speaking style of the target speaker.
In this work, we propose a novel method to extract rich style information from
target utterances and to efficiently transfer it to source speech content
without requiring text transcriptions or speaker labeling. Our proposed
approach introduces an attention mechanism utilizing a self-supervised learning
(SSL) model to collect the speaking styles of a target speaker each
corresponding to the different phonetic content. The styles are represented
with a set of embeddings called stylebook. In the next step, the stylebook is
attended with the source speech's phonetic content to determine the final
target style for each source content. Finally, content information extracted
from the source speech and content-dependent target style embeddings are fed
into a diffusion-based decoder to generate the converted speech
mel-spectrogram. Experiment results show that our proposed method combined with
a diffusion-based generative model can achieve better speaker similarity in
any-to-any voice conversion tasks when compared to baseline models, while the
increase in computational complexity with longer utterances is suppressed.
Related papers
- Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z) - Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech
using Adversarial Disentanglement of Multimodal Style Encoding [3.2116198597240846]
We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers.
Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers.
arXiv Detail & Related papers (2022-08-03T08:49:55Z) - Self-supervised Context-aware Style Representation for Expressive Speech
Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner.
It leverages an emotion lexicon and uses contrastive learning and deep clustering.
Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.