Style Description based Text-to-Speech with Conditional Prosodic Layer
Normalization based Diffusion GAN
- URL: http://arxiv.org/abs/2310.18169v1
- Date: Fri, 27 Oct 2023 14:28:41 GMT
- Title: Style Description based Text-to-Speech with Conditional Prosodic Layer
Normalization based Diffusion GAN
- Authors: Neeraj Kumar and Ankur Narang and Brejesh Lall
- Abstract summary: We present a Diffusion GAN based approach (Prosodic Diff-TTS) to generate the corresponding high-fidelity speech based on the style description and content text as an input to generate speech samples within only 4 denoising steps.
We demonstrate the efficacy of our proposed architecture on multi-speaker LibriTTS and PromptSpeech datasets, using multiple quantitative metrics that measure generated accuracy and MOS.
- Score: 17.876323494898536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a Diffusion GAN based approach (Prosodic Diff-TTS)
to generate the corresponding high-fidelity speech based on the style
description and content text as an input to generate speech samples within only
4 denoising steps. It leverages the novel conditional prosodic layer
normalization to incorporate the style embeddings into the multi head attention
based phoneme encoder and mel spectrogram decoder based generator architecture
to generate the speech. The style embedding is generated by fine tuning the
pretrained BERT model on auxiliary tasks such as pitch, speaking speed,
emotion,gender classifications. We demonstrate the efficacy of our proposed
architecture on multi-speaker LibriTTS and PromptSpeech datasets, using
multiple quantitative metrics that measure generated accuracy and MOS.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z) - Text-Driven Foley Sound Generation With Latent Diffusion Model [33.4636070590045]
Foley sound generation aims to synthesise the background sound for multimedia content.
We propose a diffusion model based system for Foley sound generation with text conditions.
arXiv Detail & Related papers (2023-06-17T14:16:24Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.