Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation
- URL: http://arxiv.org/abs/2404.01339v1
- Date: Sun, 31 Mar 2024 00:38:02 GMT
- Title: Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation
- Authors: Rohan Chaudhury, Mihir Godbole, Aakash Garg, Jinsil Hwaryoung Seo,
- Abstract summary: Modern conversational systems lack emotional depth and disfluent characteristic of human interactions.
To address this shortcoming, we have designed an innovative speech synthesis pipeline.
Within this framework, a cutting-edge language model introduces both human-like emotion and disfluencies in a zero-shot setting.
- Score: 0.6964027823688135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary conversational systems often present a significant limitation: their responses lack the emotional depth and disfluent characteristic of human interactions. This absence becomes particularly noticeable when users seek more personalized and empathetic interactions. Consequently, this makes them seem mechanical and less relatable to human users. Recognizing this gap, we embarked on a journey to humanize machine communication, to ensure AI systems not only comprehend but also resonate. To address this shortcoming, we have designed an innovative speech synthesis pipeline. Within this framework, a cutting-edge language model introduces both human-like emotion and disfluencies in a zero-shot setting. These intricacies are seamlessly integrated into the generated text by the language model during text generation, allowing the system to mirror human speech patterns better, promoting more intuitive and natural user interactions. These generated elements are then adeptly transformed into corresponding speech patterns and emotive sounds using a rule-based approach during the text-to-speech phase. Based on our experiments, our novel system produces synthesized speech that's almost indistinguishable from genuine human communication, making each interaction feel more personal and authentic.
Related papers
- Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation [7.675340768192281]
We conduct a speech comprehension study involving 39 participants.
Experiment's primary outcome shows that spaces with good acoustic quality positively correlate with intelligibility and user experience.
We develop a convolutional neural network model to adapt the robot's speech parameters to different users and spaces.
arXiv Detail & Related papers (2024-05-15T21:28:55Z) - Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents [5.244401764969407]
Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread.
We propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances.
arXiv Detail & Related papers (2023-09-17T18:46:25Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - Whither the Priors for (Vocal) Interactivity? [6.709659274527638]
Speech-based communication is often cited as one of the most natural' ways in which humans and robots might interact.
Despite this, the resulting interactions are anything but natural'
It is argued here that such communication failures are indicative of a deeper malaise.
arXiv Detail & Related papers (2022-03-16T12:06:46Z) - Emotional Prosody Control for Speech Generation [7.66200737962746]
We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space.
The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion.
arXiv Detail & Related papers (2021-11-07T08:52:04Z) - Exemplars-guided Empathetic Response Generation Controlled by the
Elements of Human Communication [88.52901763928045]
We propose an approach that relies on exemplars to cue the generative model on fine stylistic properties that signal empathy to the interlocutor.
We empirically show that these approaches yield significant improvements in empathetic response quality in terms of both automated and human-evaluated metrics.
arXiv Detail & Related papers (2021-06-22T14:02:33Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Emotion-aware Chat Machine: Automatic Emotional Response Generation for
Human-like Emotional Interaction [55.47134146639492]
This article proposes a unifed end-to-end neural architecture, which is capable of simultaneously encoding the semantics and the emotions in a post.
Experiments on real-world data demonstrate that the proposed method outperforms the state-of-the-art methods in terms of both content coherence and emotion appropriateness.
arXiv Detail & Related papers (2021-06-06T06:26:15Z) - Generating Emotionally Aligned Responses in Dialogues using Affect
Control Theory [15.848210524718219]
Affect Control Theory (ACT) is a socio-mathematical model of emotions for human-human interactions.
We investigate how ACT can be used to develop affect-aware neural conversational agents.
arXiv Detail & Related papers (2020-03-07T19:31:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.