Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any
Voice Conversion using Only Speech Data
- URL: http://arxiv.org/abs/2309.02730v3
- Date: Fri, 15 Dec 2023 04:36:41 GMT
- Title: Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any
Voice Conversion using Only Speech Data
- Authors: Hyungseob Lim, Kyungguen Byun, Sunkuk Moon, Erik Visser
- Abstract summary: We propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content.
Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model.
Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks.
- Score: 2.6217304977339473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While many recent any-to-any voice conversion models succeed in transferring
some target speech's style information to the converted speech, they still lack
the ability to faithfully reproduce the speaking style of the target speaker.
In this work, we propose a novel method to extract rich style information from
target utterances and to efficiently transfer it to source speech content
without requiring text transcriptions or speaker labeling. Our proposed
approach introduces an attention mechanism utilizing a self-supervised learning
(SSL) model to collect the speaking styles of a target speaker each
corresponding to the different phonetic content. The styles are represented
with a set of embeddings called stylebook. In the next step, the stylebook is
attended with the source speech's phonetic content to determine the final
target style for each source content. Finally, content information extracted
from the source speech and content-dependent target style embeddings are fed
into a diffusion-based decoder to generate the converted speech
mel-spectrogram. Experiment results show that our proposed method combined with
a diffusion-based generative model can achieve better speaker similarity in
any-to-any voice conversion tasks when compared to baseline models, while the
increase in computational complexity with longer utterances is suppressed.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.