Related papers: Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data

Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data

URL: http://arxiv.org/abs/2309.02730v3
Date: Fri, 15 Dec 2023 04:36:41 GMT
Title: Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data
Authors: Hyungseob Lim, Kyungguen Byun, Sunkuk Moon, Erik Visser
Abstract summary: We propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks.
Score: 2.6217304977339473
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.

Related papers

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion [53.26424100244925]
Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech.<n>In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder.
arXiv Detail & Related papers (2025-06-04T14:42:12Z)
Natural language guidance of high-fidelity text-to-speech with synthetic annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z)
StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution. We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z)
Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method. Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z)
Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech. It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z)
Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding [3.2116198597240846]
We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers. Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers.
arXiv Detail & Related papers (2022-08-03T08:49:55Z)
Self-supervised Context-aware Style Representation for Expressive Speech Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner. It leverages an emotion lexicon and uses contrastive learning and deep clustering. Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech. During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.