Related papers: FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models

FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models

URL: http://arxiv.org/abs/2403.15699v3
Date: Sun, 21 Jul 2024 13:27:02 GMT
Title: FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models
Authors: Huaiwen Zhang, Yu Chen, Ming Wang, Shi Feng,
Abstract summary: Emotional Support Conversation (ESC) is a typical dialogue that can effectively assist the user in mitigating emotional pressures. Current non-artificial methodologies face challenges in effectively appraising the emotional support capability. We propose a novel model FEEL, employing Large Language Models (LLMs) as evaluators to assess emotional support capabilities.
Score: 14.894922829587841
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Emotional Support Conversation (ESC) is a typical dialogue that can effectively assist the user in mitigating emotional pressures. However, owing to the inherent subjectivity involved in analyzing emotions, current non-artificial methodologies face challenges in effectively appraising the emotional support capability. These metrics exhibit a low correlation with human judgments. Concurrently, manual evaluation methods extremely will cause high costs. To solve these problems, we propose a novel model FEEL (Framework for Evaluating Emotional Support Capability with Large Lan-guage Models), employing Large Language Models (LLMs) as evaluators to assess emotional support capabilities. The model meticulously considers various evaluative aspects of ESC to apply a more comprehensive and accurate evaluation method for ESC. Additionally, it employs a probability distribution approach for a more stable result and integrates an ensemble learning strategy, leveraging multiple LLMs with assigned weights to enhance evaluation accuracy. To appraise the performance of FEEL, we conduct extensive experiments on existing ESC model dialogues. Experimental results demonstrate our model exhibits a substantial enhancement in alignment with human evaluations compared to the baselines. Our source code is available at https://github.com/Ansisy/FEEL.

Related papers

Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models [6.810484095299127]
Emotional support is a core capability in human-AI interaction, with applications including psychological counseling, role play, and companionship.<n>Existing evaluations of large language models (LLMs) often rely on short, static dialogues and fail to capture the dynamic and long-term nature of emotional support.<n>Our framework constructs a large-scale benchmark consisting of 328 emotional contexts and 1,152 disturbance events, simulating realistic emotional shifts under evolving dialogue scenarios.
arXiv Detail & Related papers (2025-11-12T05:47:28Z)
Towards Open-Ended Emotional Support Conversations in LLMs via Reinforcement Learning with Future-Oriented Rewards [13.938394655357916]
Emotional Support Conversation systems aim to alleviate users' emotional difficulties and provide long-term, systematic support for emotional well-being.<n>Most large language model (LLM)-based ESC systems rely on predefined strategies, which limits their effectiveness in complex, real-life scenarios.<n>This paper introduces a novel end-to-end framework (RLFF-ESC) that directly learns enduring emotionally supportive response skills using reinforcement learning.
arXiv Detail & Related papers (2025-08-18T14:04:26Z)
Emotional Support with LLM-based Empathetic Dialogue Generation [5.289702620838033]
This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation.<n>We leverage large-scale language models enhanced by prompt engineering and finetuning techniques.
arXiv Detail & Related papers (2025-07-17T06:24:20Z)
IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems [74.0855067343594]
In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies.<n>We propose the Intention-centered Emotional Support Conversation framework.<n>It defines the possible intentions of supporters, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies.
arXiv Detail & Related papers (2025-06-06T10:14:49Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF) In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom [0.0]
Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure semantic similarity. A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model.
arXiv Detail & Related papers (2024-11-02T15:22:26Z)
AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment [15.969280805269976]
AERA Chat is an interactive visualization platform designed for automated explainable student answer assessment.<n>AERA Chat leverages multiple language models (LLMs) to concurrently score student answers and generate explanatory rationales.<n>We demonstrate the effectiveness of our platform through evaluations of multiple rationale-generation methods on several datasets.
arXiv Detail & Related papers (2024-10-12T11:57:53Z)
Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates [0.0]
We propose a framework that interprets large language models (LLMs) as advocates within an ensemble of interacting agents. This approach offers a more dynamic and comprehensive evaluation process compared to traditional human-based assessments or automated metrics.
arXiv Detail & Related papers (2024-10-07T00:22:07Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z)
Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements [28.630542719519855]
This work empirically investigates the performance of large language models (LLMs) in generating empathetic responses. Extensive experiments show that LLMs can significantly benefit from our proposed methods and is able to achieve state-of-the-art performance in both automatic and human evaluations.
arXiv Detail & Related papers (2023-10-08T12:21:24Z)
Building Emotional Support Chatbots in the Era of LLMs [64.06811786616471]
We introduce an innovative methodology that synthesizes human insights with the computational prowess of Large Language Models (LLMs) By utilizing the in-context learning potential of ChatGPT, we generate an ExTensible Emotional Support dialogue dataset, named ExTES. Following this, we deploy advanced tuning techniques on the LLaMA model, examining the impact of diverse training strategies, ultimately yielding an LLM meticulously optimized for emotional support interactions.
arXiv Detail & Related papers (2023-08-17T10:49:18Z)
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap [47.632123167141245]
We argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization. We urge the community to develop evaluation methods based on real-world contexts and human requirements.
arXiv Detail & Related papers (2023-06-01T00:01:43Z)
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs) In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol. We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z)
On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model, Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews. We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.