Related papers: EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

URL: http://arxiv.org/abs/2510.22758v1
Date: Sun, 26 Oct 2025 17:15:56 GMT
Title: EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
Authors: Li Zhou, Lutong Yu, You Lyu, Yihang Lin, Zefeng Zhao, Junyi Ao, Yuhao Zhang, Benyou Wang, Haizhou Li,
Abstract summary: Speech Language Models (SLMs) have made significant progress in spoken language understanding.<n>It remains unclear whether SLMs can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors.<n>We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue.
Score: 47.41816926003011
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation. All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled variations in vocal style are used to test the effect of delivery independent of the transcript. EchoMind is grounded in an empathy-oriented framework spanning 3 coarse and 12 fine-grained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics. Testing 12 advanced SLMs reveals that even state-of-the-art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instruction-following, resilience to natural speech variability, and effective use of vocal cues for empathy. These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.

Related papers

Reflecting Twice before Speaking with Empathy: Self-Reflective Alternating Inference for Empathy-Aware End-to-End Spoken Dialogue [53.95386201009769]
We introduce EmpathyEval, a descriptive natural-language-based evaluation model for assessing empathetic quality in spoken dialogues.<n>We propose ReEmpathy, an end-to-end Spoken Language Models that enhances empathetic dialogue through a novel Empathetic Self-Reflective Alternating Inference mechanism.
arXiv Detail & Related papers (2026-01-26T09:04:50Z)
ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation [30.006550552714938]
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information.<n>Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations.<n>We propose textbfES4R, a framework for speech-based empathetic response generation.
arXiv Detail & Related papers (2026-01-16T10:26:50Z)
Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue [12.181747090385612]
Mental manipulation is the strategic use of language to covertly influence or exploit others.<n>We present the first study of mental manipulation detection in spoken dialogues.<n>Using few-shot large audio-language models and human annotation, we evaluate how modality affects detection accuracy and perception.
arXiv Detail & Related papers (2026-01-13T09:02:08Z)
Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech [0.13048920509133805]
We evaluate four spoken language models (SLMs) on the task of speech emotion recognition.<n>Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task.
arXiv Detail & Related papers (2025-10-29T00:45:36Z)
Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z)
Marco-Voice Technical Report [35.01600797874603]
The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation.<n>Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning.<n>To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset.
arXiv Detail & Related papers (2025-08-04T04:08:22Z)
SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models [76.07833875692722]
Speech-based Intelligence Quotient (SIQ) is a new form of human cognition-inspired evaluation pipeline for voice understanding large language models.<n>Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks.
arXiv Detail & Related papers (2025-07-25T15:12:06Z)
MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z)
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models [31.584937435966253]
We propose VocalBench to assess the speech conversational abilities.<n>It comprises 9,400 carefully curated instances across four key dimensions.<n>It covers a broad range of fundamental skills essential for effective vocal interactions.
arXiv Detail & Related papers (2025-05-21T16:34:07Z)
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT) Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z)
deep learning of segment-level feature representation for speech emotion recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions. First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances. Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z)
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.