LaughNet: synthesizing laughter utterances from waveform silhouettes and
a single laughter example
- URL: http://arxiv.org/abs/2110.04946v1
- Date: Mon, 11 Oct 2021 00:45:07 GMT
- Title: LaughNet: synthesizing laughter utterances from waveform silhouettes and
a single laughter example
- Authors: Hieu-Thi Luong, Junichi Yamagishi
- Abstract summary: We propose a model called LaughNet for synthesizing laughter by using waveform silhouettes as inputs.
The results show that LaughNet can synthesize laughter utterances with moderate quality and retain the characteristics of the training example.
- Score: 55.10864476206503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotional and controllable speech synthesis is a topic that has received much
attention. However, most studies focused on improving the expressiveness and
controllability in the context of linguistic content, even though natural
verbal human communication is inseparable from spontaneous non-speech
expressions such as laughter, crying, or grunting. We propose a model called
LaughNet for synthesizing laughter by using waveform silhouettes as inputs. The
motivation is not simply synthesizing new laughter utterances but testing a
novel synthesis-control paradigm that uses an abstract representation of the
waveform. We conducted basic listening test experiments, and the results showed
that LaughNet can synthesize laughter utterances with moderate quality and
retain the characteristics of the training example. More importantly, the
generated waveforms have shapes similar to the input silhouettes. For future
work, we will test the same method on other types of human nonverbal
expressions and integrate it into more elaborated synthesis systems.
Related papers
- Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like [49.2096391012794]
ELaTE is a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt.
We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS.
We show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models.
arXiv Detail & Related papers (2024-02-12T02:58:10Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together.
We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z) - How Generative Spoken Language Modeling Encodes Noisy Speech:
Investigation from Phonetics to Syntactics [33.070158866023]
generative spoken language modeling (GSLM) involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis.
This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels.
arXiv Detail & Related papers (2023-06-01T14:07:19Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning [6.514358246805895]
We propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system.
We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations.
arXiv Detail & Related papers (2020-08-20T09:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.