Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis
- URL: http://arxiv.org/abs/2012.07252v1
- Date: Mon, 14 Dec 2020 04:37:07 GMT
- Title: Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis
- Authors: Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall
- Abstract summary: We present a novel few shot multi-speaker speech synthesis approach (FSM-SS)
Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner.
We demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency.
- Score: 18.812696623555855
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The style of the speech varies from person to person and every person
exhibits his or her own style of speaking that is determined by the language,
geography, culture and other factors. Style is best captured by prosody of a
signal. High quality multi-speaker speech synthesis while considering prosody
and in a few shot manner is an area of active research with many real-world
applications. While multiple efforts have been made in this direction, it
remains an interesting and challenging problem. In this paper, we present a
novel few shot multi-speaker speech synthesis approach (FSM-SS) that leverages
adaptive normalization architecture with a non-autoregressive multi-head
attention model. Given an input text and a reference speech sample of an unseen
person, FSM-SS can generate speech in that person's style in a few shot manner.
Additionally, we demonstrate how the affine parameters of normalization help in
capturing the prosodic features such as energy and fundamental frequency in a
disentangled fashion and can be used to generate morphed speech output. We
demonstrate the efficacy of our proposed architecture on multi-speaker VCTK and
LibriTTS datasets, using multiple quantitative metrics that measure generated
speech distortion and MoS, along with speaker embedding analysis of the
generated speech vs the actual speech samples.
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech [14.323313455208183]
We propose a novel approach to disentangle speaker and accent representations using multi-level variational autoencoders (ML-VAE) and vector quantization (VQ)
Our proposed method addresses the challenge of effectively separating speaker and accent characteristics, enabling more fine-grained control over the synthesized speech.
arXiv Detail & Related papers (2024-10-17T08:51:46Z) - Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion
Analysis [19.271542595753267]
This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text.
We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features.
arXiv Detail & Related papers (2023-09-21T07:45:44Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis [37.65745551401636]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
arXiv Detail & Related papers (2023-03-14T08:52:58Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Multi Modal Adaptive Normalization for Audio to Video Generation [18.812696623555855]
We propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person.
The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components.
arXiv Detail & Related papers (2020-12-14T07:39:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.