Related papers: Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

URL: http://arxiv.org/abs/2008.09483v1
Date: Thu, 20 Aug 2020 09:37:28 GMT
Title: Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning
Authors: No\'e Tits, Kevin El Haddad, Thierry Dutoit
Abstract summary: We propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system. We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations.
Score: 6.514358246805895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the growing interest for expressive speech synthesis, synthesis of nonverbal expressions is an under-explored area. In this paper we propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system. We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations. We evaluate our model with a listening test, comparing its performance to an HMM-based laughter synthesis one and assess that it reaches higher perceived naturalness. Our solution is a first step towards a TTS system that would be able to synthesize speech with a control on amusement level with laughter integration.

Related papers

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2. Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens. We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation [52.0893266767733]
We propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features. To enhance the model's robustness to different synthesizer characteristics, we propose a synthesizer feature augmentation strategy.
arXiv Detail & Related papers (2024-11-14T03:57:21Z)
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like [49.2096391012794]
ELaTE is a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt. We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS. We show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models.
arXiv Detail & Related papers (2024-02-12T02:58:10Z)
Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning [3.5032870024762386]
This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech. The approach involved finetuning a multi-speaker TTS model to work with child speech. We conducted an objective assessment that showed a significant correlation between real and synthetic child voices.
arXiv Detail & Related papers (2023-11-07T19:31:44Z)
Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z)
How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics [33.070158866023]
generative spoken language modeling (GSLM) involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels.
arXiv Detail & Related papers (2023-06-01T14:07:19Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS) Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z)
LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example [55.10864476206503]
We propose a model called LaughNet for synthesizing laughter by using waveform silhouettes as inputs. The results show that LaughNet can synthesize laughter utterances with moderate quality and retain the characteristics of the training example.
arXiv Detail & Related papers (2021-10-11T00:45:07Z)
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z)
Integrated Speech and Gesture Synthesis [26.267738299876314]
Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities. We propose to synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG) Model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system.
arXiv Detail & Related papers (2021-08-25T19:04:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.