ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models
- URL: http://arxiv.org/abs/2406.14952v3
- Date: Mon, 28 Oct 2024 13:25:49 GMT
- Title: ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models
- Authors: Haiquan Zhao, Lingyu Li, Shisong Chen, Shuqi Kong, Jiaan Wang, Kexin Huang, Tianle Gu, Yixu Wang, Wang Jian, Dandan Liang, Zhixu Li, Yan Teng, Yanghua Xiao, Yingchun Wang,
- Abstract summary: Emotion Support Conversation (ESC) aims to reduce human stress, offer emotional guidance, and enhance human mental and physical well-being.
We propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models.
We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models.
- Score: 55.301188787490545
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/AIFlames/Esc-Eval.
Related papers
- RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing [111.06936588273868]
RMTBench is a comprehensive textbfuser-centric bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds.<n>Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications.<n>By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements.
arXiv Detail & Related papers (2025-07-27T16:49:47Z) - ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents [2.3020018305241337]
We present ESC-Judge, the first end-to-end evaluation framework for large language models (LLMs)<n> ESC-Judge grounds head-to-head comparisons of emotional-support LLMs in Clara Hill's established Exploration-Insight-Action counseling model.<n>All code, prompts, synthetic roles, transcripts, and judgment scripts are released to promote transparent progress in emotionally supportive AI.
arXiv Detail & Related papers (2025-05-18T20:04:59Z) - Convert Language Model into a Value-based Strategic Planner [11.070654717643816]
Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations.<n>We propose a framework called straQ* to define the diagram from the state model perspective.<n>Our framework allows a plug-and-play LLM to bootstrap the planning during ESC, determine the optimal strategy based on long-term returns, and finally guide the LLM to response.
arXiv Detail & Related papers (2025-05-11T14:13:58Z) - FiSMiness: A Finite State Machine Based Paradigm for Emotional Support Conversations [11.718316719735832]
Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations.
We leverage the Finite State Machine (FSM) on large language models and propose a framework called FiSMiness.
Our framework allows a single LLM to bootstrap the planning during ESC, and self-reason the seeker's emotion, support strategy and the final response upon each conversational turn.
arXiv Detail & Related papers (2025-04-16T07:52:06Z) - MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments [0.0]
This paper introduces the Multiverse Interactive Role-play Ability General Evaluation (MIRAGE)
MIRAGE is a framework designed to assess Large Language Models' proficiency in portraying advanced human behaviors through murder mystery games.
Our experiments indicate that even popular models like GPT-4 face significant challenges in navigating the complexities presented by the MIRAGE.
arXiv Detail & Related papers (2025-01-03T06:07:48Z) - Thinking Before Speaking: A Role-playing Model with Mindset [0.6428333375712125]
Large Language Models (LLMs) are skilled at simulating human behaviors.
These models tend to perform poorly when confronted with knowledge that the assumed role does not possess.
We propose a Thinking Before Speaking (TBS) model in this paper.
arXiv Detail & Related papers (2024-09-14T02:41:48Z) - DuetSim: Building User Simulator with Dual Large Language Models for Task-Oriented Dialogues [7.765092134290888]
This paper introduces DuetSim, a novel framework designed to address the intricate demands of task-oriented dialogues by leveraging large language models.
DuetSim stands apart from conventional approaches by employing two LLMs in tandem: one dedicated to response generation and the other focused on verification.
We validate the efficacy of our method through extensive experiments conducted on the MultiWOZ dataset, highlighting improvements in response quality and correctness.
arXiv Detail & Related papers (2024-05-16T06:24:31Z) - From Persona to Personalization: A Survey on Role-Playing Language Agents [52.783043059715546]
Recent advancements in large language models (LLMs) have boosted the rise of Role-Playing Language Agents (RPLAs)
RPLAs achieve a remarkable sense of human likeness and vivid role-playing performance.
They have catalyzed numerous AI applications, such as emotional companions, interactive video games, personalized assistants and copilots.
arXiv Detail & Related papers (2024-04-28T15:56:41Z) - Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts [10.929547354171723]
This paper introduces Knowledgeable Agents from Language Model Rollouts (KALM)
It extracts knowledge from large language models (LLMs) in the form of imaginary rollouts that can be easily learned by the agent through offline reinforcement learning methods.
It achieves a success rate of 46% in executing tasks with unseen goals, substantially surpassing the 26% success rate achieved by baseline methods.
arXiv Detail & Related papers (2024-04-14T13:19:40Z) - FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models [14.894922829587841]
Emotional Support Conversation (ESC) is a typical dialogue that can effectively assist the user in mitigating emotional pressures.
Current non-artificial methodologies face challenges in effectively appraising the emotional support capability.
We propose a novel model FEEL, employing Large Language Models (LLMs) as evaluators to assess emotional support capabilities.
arXiv Detail & Related papers (2024-03-23T03:32:26Z) - LanguageMPC: Large Language Models as Decision Makers for Autonomous
Driving [87.1164964709168]
This work employs Large Language Models (LLMs) as a decision-making component for complex autonomous driving scenarios.
Extensive experiments demonstrate that our proposed method not only consistently surpasses baseline approaches in single-vehicle tasks, but also helps handle complex driving behaviors even multi-vehicle coordination.
arXiv Detail & Related papers (2023-10-04T17:59:49Z) - How Can Recommender Systems Benefit from Large Language Models: A Survey [82.06729592294322]
Large language models (LLM) have shown impressive general intelligence and human-like capabilities.
We conduct a comprehensive survey on this research direction from the perspective of the whole pipeline in real-world recommender systems.
arXiv Detail & Related papers (2023-06-09T11:31:50Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z) - Improving Multi-turn Emotional Support Dialogue Generation with
Lookahead Strategy Planning [81.79431311952656]
We propose a novel system MultiESC to provide Emotional Support.
For strategy planning, we propose lookaheads to estimate the future user feedback after using particular strategies.
For user state modeling, MultiESC focuses on capturing users' subtle emotional expressions and understanding their emotion causes.
arXiv Detail & Related papers (2022-10-09T12:23:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.