GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2205.07211v1
- Date: Sun, 15 May 2022 08:16:02 GMT
- Title: GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis
- Authors: Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
- Abstract summary: This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
- Score: 68.42632589736881
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Style transfer for out-of-domain (OOD) speech synthesis aims to generate
speech samples with unseen style (e.g., speaker identity, emotion, and prosody)
derived from an acoustic reference, while facing the following challenges: 1)
The highly dynamic style features in expressive voice are difficult to model
and transfer; and 2) the TTS models should be robust enough to handle diverse
OOD conditions that differ from the source data. This paper proposes
GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style
transfer of OOD custom voice. GenerSpeech decomposes the speech variation into
the style-agnostic and style-specific parts by introducing two components: 1) a
multi-level style adaptor to efficiently model a large range of style
conditions, including global speaker and emotion characteristics, and the local
(utterance, phoneme, and word-level) fine-grained prosodic representations; and
2) a generalizable content adaptor with Mix-Style Layer Normalization to
eliminate style information in the linguistic content representation and thus
improve model generalization. Our evaluations on zero-shot style transfer
demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of
audio quality and style similarity. The extension studies to adaptive style
transfer further show that GenerSpeech performs robustly in the few-shot data
setting. Audio samples are available at \url{https://GenerSpeech.github.io/}
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Referee: Towards reference-free cross-speaker style transfer with
low-quality data for expressive speech synthesis [39.730034713382736]
Cross-speaker style transfer (CSST) in text-to-speech (TTS) aims at transferring a speaking style to the synthesised speech in a target speaker's voice.
This presents Referee, a robust reference-free CSST approach for expressive TTS, which fully leverages low-quality data to learn speaking styles from text.
arXiv Detail & Related papers (2021-09-08T05:39:34Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.