Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
- URL: http://arxiv.org/abs/2402.07383v2
- Date: Mon, 4 Mar 2024 19:15:29 GMT
- Title: Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
- Authors: Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker,
Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Chung-Hsien Tsai, Zhen Xiao,
Yufei Xia, Jinzhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng
- Abstract summary: ELaTE is a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt.
We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS.
We show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models.
- Score: 49.2096391012794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Laughter is one of the most expressive and natural aspects of human speech,
conveying emotions, social cues, and humor. However, most text-to-speech (TTS)
systems lack the ability to produce realistic and appropriate laughter sounds,
limiting their applications and user experience. While there have been prior
works to generate natural laughter, they fell short in terms of controlling the
timing and variety of the laughter to be generated. In this work, we propose
ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker
based on a short audio prompt with precise control of laughter timing and
expression. Specifically, ELaTE works on the audio prompt to mimic the voice
characteristic, the text prompt to indicate the contents of the generated
speech, and the input to control the laughter expression, which can be either
the start and end times of laughter, or the additional audio prompt that
contains laughter to be mimicked. We develop our model based on the foundation
of conditional flow-matching-based zero-shot TTS, and fine-tune it with
frame-level representation from a laughter detector as additional conditioning.
With a simple scheme to mix small-scale laughter-conditioned data with
large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS
model can be readily fine-tuned to generate natural laughter with precise
controllability, without losing any quality of the pre-trained zero-shot TTS
model. Through objective and subjective evaluations, we show that ELaTE can
generate laughing speech with significantly higher quality and controllability
compared to conventional models. See https://aka.ms/elate/ for demo samples.
Related papers
- Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech [51.486112860259595]
EmoCtrl-TTS is an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker.
To achieve high-quality emotional speech generation, EmoCtrl-TTS is trained using more than 27,000 hours of expressive data curated based on pseudo-labeling.
arXiv Detail & Related papers (2024-07-17T00:54:15Z) - LaughTalk: Expressive 3D Talking Head Generation with Laughter [15.60843963655039]
We introduce a novel task to generate 3D talking heads capable of both articulate speech and authentic laughter.
Our newly curated dataset comprises 2D laughing videos paired with pseudo-annotated and human-validated 3D FLAME parameters.
Our method performs favorably compared to existing approaches in both talking head generation and expressing laughter signals.
arXiv Detail & Related papers (2023-11-02T05:04:33Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource
Scenarios [5.06044403956839]
We develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios.
We extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way.
Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian.
arXiv Detail & Related papers (2023-05-20T14:24:45Z) - LaughNet: synthesizing laughter utterances from waveform silhouettes and
a single laughter example [55.10864476206503]
We propose a model called LaughNet for synthesizing laughter by using waveform silhouettes as inputs.
The results show that LaughNet can synthesize laughter utterances with moderate quality and retain the characteristics of the training example.
arXiv Detail & Related papers (2021-10-11T00:45:07Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style [111.89762723159677]
We develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
arXiv Detail & Related papers (2021-07-06T10:40:45Z) - Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning [6.514358246805895]
We propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system.
We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations.
arXiv Detail & Related papers (2020-08-20T09:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.