HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
- URL: http://arxiv.org/abs/2601.19922v1
- Date: Fri, 09 Jan 2026 06:48:08 GMT
- Title: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
- Authors: Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee,
- Abstract summary: We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations.<n>Several frontier models approach or surpass the average human responses in perceived empathy and consistency.<n> HEART reframes supportive dialogue as a distinct capability axis, separable from general reasoning or linguistic fluency.
- Score: 29.077783997591037
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators. All assessments follow a rubric grounded in interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following. HEART uncovers striking behavioral patterns. Several frontier models approach or surpass the average human responses in perceived empathy and consistency. At the same time, humans maintain advantages in adaptive reframing, tension-naming, and nuanced tone shifts, particularly in adversarial turns. Human and LLM-as-judge preferences align on about 80 percent of pairwise comparisons, matching inter-human agreement, and their written rationales emphasize similar HEART dimensions. This pattern suggests an emerging convergence in the criteria used to assess supportive quality. By placing humans and models on equal footing, HEART reframes supportive dialogue as a distinct capability axis, separable from general reasoning or linguistic fluency. It provides a unified empirical foundation for understanding where model-generated support aligns with human social judgment, where it diverges, and how affective conversational competence scales with model size.
Related papers
- Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution [7.599497643290519]
Large language models (LLMs) are increasingly used to simulate human behavior in social settings.<n>It remains unclear whether these simulations reproduce the personality-behavior patterns observed in humans.
arXiv Detail & Related papers (2026-02-07T07:20:24Z) - The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era [95.35748535806744]
We launch the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026.<n>This paper summarizes the dataset, track configurations, and the final results.
arXiv Detail & Related papers (2026-01-09T06:32:30Z) - A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction [50.05919688888947]
This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT)<n>IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision.<n> Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation.
arXiv Detail & Related papers (2026-01-08T14:07:30Z) - Computational Analysis of Conversation Dynamics through Participant Responsivity [18.116125865284666]
We develop and evaluate methods for quantifying responsivity.<n>We evaluate both methods against a ground truth set of human-annotated conversations.<n>We then develop conversation-level derived metrics to address various aspects of conversational discourse.
arXiv Detail & Related papers (2025-09-19T23:13:13Z) - Are You Listening to Me? Fine-Tuning Chatbots for Empathetic Dialogue [0.5849783371898033]
We explore how Large Language Models (LLMs) respond when tasked with generating emotionally rich interactions.<n>We analyzed the emotional progression of the dialogues using both sentiment analysis (via VADER) and expert assessments.
arXiv Detail & Related papers (2025-07-03T11:32:41Z) - Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models [75.85319609088354]
Sentient Agent as a Judge (SAGE) is an evaluation framework for large language models.<n>SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction.<n>SAGE provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
arXiv Detail & Related papers (2025-05-01T19:06:10Z) - Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction.
We find that contextual characteristics significantly affect human reliance behavior.
Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z) - Multi-dimensional Evaluation of Empathetic Dialog Responses [4.580983642743026]
We propose a multi-dimensional empathy evaluation framework to measure both emphexpressed intents from the speaker's perspective and emphperceived empathy from the listener's perspective.
We find the two dimensions are inter-connected, while perceived empathy has high correlations with dialogue satisfaction levels.
arXiv Detail & Related papers (2024-02-18T00:32:33Z) - Perspective-taking and Pragmatics for Generating Empathetic Responses
Focused on Emotion Causes [50.569762345799354]
We argue that two issues must be tackled at the same time: (i) identifying which word is the cause for the other's emotion from his or her utterance and (ii) reflecting those specific words in the response generation.
Taking inspiration from social cognition, we leverage a generative estimator to infer emotion cause words from utterances with no word-level label.
arXiv Detail & Related papers (2021-09-18T04:22:49Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.