Towards Cross-speaker Reading Style Transfer on Audiobook Dataset
- URL: http://arxiv.org/abs/2208.05359v1
- Date: Wed, 10 Aug 2022 14:08:35 GMT
- Title: Towards Cross-speaker Reading Style Transfer on Audiobook Dataset
- Authors: Xiang Li, Changhe Song, Xianhao Wei, Zhiyong Wu, Jia Jia, Helen Meng
- Abstract summary: Cross-speaker style transfer aims to extract the speech style of the given reference speech.
audiobook datasets are typically characterized by both the local prosody and global genre.
- Score: 43.99232352300273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-speaker style transfer aims to extract the speech style of the given
reference speech, which can be reproduced in the timbre of arbitrary target
speakers. Existing methods on this topic have explored utilizing
utterance-level style labels to perform style transfer via either global or
local scale style representations. However, audiobook datasets are typically
characterized by both the local prosody and global genre, and are rarely
accompanied by utterance-level style labels. Thus, properly transferring the
reading style across different speakers remains a challenging task. This paper
aims to introduce a chunk-wise multi-scale cross-speaker style model to capture
both the global genre and the local prosody in audiobook speeches. Moreover, by
disentangling speaker timbre and style with the proposed switchable adversarial
classifiers, the extracted reading style is made adaptable to the timbre of
different speakers. Experiment results confirm that the model manages to
transfer a given reading style to new target speakers. With the support of
local prosody and global genre type predictor, the potentiality of the proposed
method in multi-speaker audiobook generation is further revealed.
Related papers
- Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any
Voice Conversion using Only Speech Data [2.6217304977339473]
We propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content.
Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model.
Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks.
arXiv Detail & Related papers (2023-09-06T05:33:54Z) - ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style
Transfer [57.6482608202409]
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning.
We introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles.
We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
arXiv Detail & Related papers (2023-08-29T17:36:02Z) - Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis [37.65745551401636]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
arXiv Detail & Related papers (2023-03-14T08:52:58Z) - Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and
Speaker-wise Normalization in Speech Synthesis [37.19266733527613]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre.
Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable.
We propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels.
arXiv Detail & Related papers (2022-12-13T06:26:25Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech
Synthesis [8.603535906880937]
Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale.
Existing style transfer methods are still far behind real application needs.
We propose a cross-speaker style transfer text-to-speech model with explicit prosody bottleneck.
arXiv Detail & Related papers (2021-07-27T02:43:57Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.