Related papers: A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction

A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction

URL: http://arxiv.org/abs/2601.04960v1
Date: Thu, 08 Jan 2026 14:07:30 GMT
Title: A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction
Authors: Qing Wang, Zehan Li, Yaodong Song, Hongjie Chen, Jian Kang, Jie Lian, Jie Li, Yongxiang Li, Xuelong Li,
Abstract summary: This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT)<n>IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision.<n> Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation.
Score: 50.05919688888947
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.

Related papers

Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models [6.810484095299127]
Emotional support is a core capability in human-AI interaction, with applications including psychological counseling, role play, and companionship.<n>Existing evaluations of large language models (LLMs) often rely on short, static dialogues and fail to capture the dynamic and long-term nature of emotional support.<n>Our framework constructs a large-scale benchmark consisting of 328 emotional contexts and 1,152 disturbance events, simulating realistic emotional shifts under evolving dialogue scenarios.
arXiv Detail & Related papers (2025-11-12T05:47:28Z)
From Rational Answers to Emotional Resonance: The Role of Controllable Emotion Generation in Language Models [16.350658746140788]
Large language models (LLMs) struggle to express emotions in a consistent, controllable, and contextually appropriate manner.<n>We propose a controllable emotion generation framework based on Emotion Vectors (EVs)<n>Our method enables fine-grained, continuous modulation of emotional tone without any additional training or architectural modification.
arXiv Detail & Related papers (2025-02-06T13:38:57Z)
ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains [61.50113532215864]
Causal Emotion Entailment (CEE) aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance. Current works in CEE mainly focus on modeling semantic and emotional interactions in conversations. We introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations.
arXiv Detail & Related papers (2024-05-17T15:45:08Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity. Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z)
Think Twice: A Human-like Two-stage Conversational Agent for Emotional Response Generation [16.659457455269127]
We propose a two-stage conversational agent for the generation of emotional dialogue. First, a dialogue model trained without the emotion-annotated dialogue corpus generates a prototype response that meets the contextual semantics. Secondly, the first-stage prototype is modified by a controllable emotion refiner with the empathy hypothesis.
arXiv Detail & Related papers (2023-01-12T10:03:56Z)
Constructing Emotion Consensus and Utilizing Unpaired Data for Empathetic Dialogue Generation [22.2430593119389]
We propose a dual-generative model, Dual-Emp, to simultaneously construct the emotion consensus and utilize some external unpaired data. Our method outperforms competitive baselines in producing coherent and empathetic responses.
arXiv Detail & Related papers (2021-09-16T07:57:01Z)
Emotion-aware Chat Machine: Automatic Emotional Response Generation for Human-like Emotional Interaction [55.47134146639492]
This article proposes a unifed end-to-end neural architecture, which is capable of simultaneously encoding the semantics and the emotions in a post. Experiments on real-world data demonstrate that the proposed method outperforms the state-of-the-art methods in terms of both content coherence and emotion appropriateness.
arXiv Detail & Related papers (2021-06-06T06:26:15Z)
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. We propose a new interactive training paradigm for ETTS, denoted as i-ETTS. We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.