EmoNews: A Spoken Dialogue System for Expressive News Conversations
- URL: http://arxiv.org/abs/2506.13894v1
- Date: Mon, 16 Jun 2025 18:16:04 GMT
- Title: EmoNews: A Spoken Dialogue System for Expressive News Conversations
- Authors: Ryuki Matsuura, Shikhar Bharadwaj, Jiarui Liu, Dhatchi Kunde Govindarajan,
- Abstract summary: We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues.<n>We propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems.
- Score: 2.6036734225145133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced at https://github.com/dhatchi711/espnet-emotional-news/tree/emo-sds/egs2/emo_news_sds/sds1
Related papers
- EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations [1.9827837167752067]
The development of text-to-speech (TTS) systems capable of controlling subtle emotional differences remains a formidable challenge.<n>Existing emotional speech databases often suffer from overly simplistic labelling schemes that fail to capture a wide range of emotional states.<n>We propose a novel process aimed at building databases by systematically extracting emotion-rich speech segments and annotating them with detailed natural language descriptions.
arXiv Detail & Related papers (2024-12-09T15:36:37Z) - Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions [37.075331767703986]
Current emotional text-to-speech systems face challenges in conveying the full spectrum of human emotions.<n>This paper introduces a TTS framework that provides flexible user control over three emotional dimensions - pleasure, arousal, and dominance.
arXiv Detail & Related papers (2024-09-25T07:16:16Z) - UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts [64.02363948840333]
UMETTS is a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.<n>EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.<n>EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - Personality-affected Emotion Generation in Dialog Systems [67.40609683389947]
We propose a new task, Personality-affected Emotion Generation, to generate emotion based on the personality given to the dialog system.
We analyze the challenges in this task, i.e., (1) heterogeneously integrating personality and emotional factors and (2) extracting multi-granularity emotional information in the dialog context.
Results suggest that by adopting our method, the emotion generation performance is improved by 13% in macro-F1 and 5% in weighted-F1 from the BERT-base model.
arXiv Detail & Related papers (2024-04-03T08:48:50Z) - Enhancing Emotional Generation Capability of Large Language Models via Emotional Chain-of-Thought [50.13429055093534]
Large Language Models (LLMs) have shown remarkable performance in various emotion recognition tasks.
We propose the Emotional Chain-of-Thought (ECoT) to enhance the performance of LLMs on various emotional generation tasks.
arXiv Detail & Related papers (2024-01-12T16:42:10Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Learning Emotional Representations from Imbalanced Speech Data for
Speech Emotion Recognition and Emotional Text-to-Speech [1.4986031916712106]
Speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks.
Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations.
We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets.
arXiv Detail & Related papers (2023-06-09T07:04:56Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Analysis of impact of emotions on target speech extraction and speech
separation [30.06415464303977]
We investigate the influence of emotions on blind speech separation (BSS) and target speech extraction (TSE)
We observe that BSS is relatively robust to emotions, while TSE, which requires identifying and extracting the speech of a target speaker, is much more sensitive to emotions.
arXiv Detail & Related papers (2022-08-15T09:47:13Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Simulated Annealing for Emotional Dialogue Systems [22.96717845092991]
We consider the task of expressing a specific emotion for dialogue generation.
Our proposed method shows 12% improvements in emotion accuracy compared with the previous state-of-the-art method.
arXiv Detail & Related papers (2021-09-22T13:17:17Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.