ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource
Scenarios
- URL: http://arxiv.org/abs/2305.12200v1
- Date: Sat, 20 May 2023 14:24:45 GMT
- Title: ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource
Scenarios
- Authors: Yuyue Wang, Huan Xiao, Yihan Wu, Ruihua Song
- Abstract summary: We develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios.
We extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way.
Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian.
- Score: 5.06044403956839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text to Speech (TTS) models can generate natural and high-quality speech, but
it is not expressive enough when synthesizing speech with dramatic
expressiveness, such as stand-up comedies. Considering comedians have diverse
personal speech styles, including personal prosody, rhythm, and fillers, it
requires real-world datasets and strong speech style modeling capabilities,
which brings challenges. In this paper, we construct a new dataset and develop
ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in
low-resource scenarios. First, we extract prosody representation by the prosody
encoder and condition it to the TTS model in a flexible way. Second, we enhance
the personal rhythm modeling by a conditional duration predictor. Third, we
model the personal fillers by introducing comedian-related special tokens.
Experiments show that ComedicSpeech achieves better expressiveness than
baselines with only ten-minute training data for each comedian. The audio
samples are available at https://xh621.github.io/stand-up-comedy-demo/
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and
Pause-based Prosody Modeling [25.966328901566815]
We propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling.
Experimental results show PauseSpeech outperforms previous models in terms of naturalness.
arXiv Detail & Related papers (2023-06-13T01:36:55Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech [33.01930038988336]
We introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework, called Face-TTS.
We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments.
Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers.
arXiv Detail & Related papers (2023-02-27T11:59:28Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary
Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input.
Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z) - AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style [111.89762723159677]
We develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
arXiv Detail & Related papers (2021-07-06T10:40:45Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.