Related papers: ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents

ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents

URL: http://arxiv.org/abs/2505.12531v1
Date: Sun, 18 May 2025 20:04:59 GMT
Title: ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents
Authors: Navid Madani, Rohini Srihari,
Abstract summary: We present ESC-Judge, the first end-to-end evaluation framework for large language models (LLMs)<n> ESC-Judge grounds head-to-head comparisons of emotional-support LLMs in Clara Hill's established Exploration-Insight-Action counseling model.<n>All code, prompts, synthetic roles, transcripts, and judgment scripts are released to promote transparent progress in emotionally supportive AI.
Score: 2.3020018305241337
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) increasingly power mental-health chatbots, yet the field still lacks a scalable, theory-grounded way to decide which model is most effective to deploy. We present ESC-Judge, the first end-to-end evaluation framework that (i) grounds head-to-head comparisons of emotional-support LLMs in Clara Hill's established Exploration-Insight-Action counseling model, providing a structured and interpretable view of performance, and (ii) fully automates the evaluation pipeline at scale. ESC-Judge operates in three stages: first, it synthesizes realistic help-seeker roles by sampling empirically salient attributes such as stressors, personality, and life history; second, it has two candidate support agents conduct separate sessions with the same role, isolating model-specific strategies; and third, it asks a specialized judge LLM to express pairwise preferences across rubric-anchored skills that span the Exploration, Insight, and Action spectrum. In our study, ESC-Judge matched PhD-level annotators on 85 percent of Exploration, 83 percent of Insight, and 86 percent of Action decisions, demonstrating human-level reliability at a fraction of the cost. All code, prompts, synthetic roles, transcripts, and judgment scripts are released to promote transparent progress in emotionally supportive AI.

Related papers

DICE: Dynamic In-Context Example Selection in LLM Agents via Efficient Knowledge Transfer [50.64531021352504]
Large language model-based agents, empowered by in-context learning (ICL), have demonstrated strong capabilities in complex reasoning and tool-use tasks.<n>Existing approaches typically rely on example selection, including in some agentic or multi-step settings.<n>We propose DICE, a theoretically grounded ICL framework for agentic tasks that selects the most relevant demonstrations at each step of reasoning.
arXiv Detail & Related papers (2025-07-31T13:42:14Z)
Towards a Cascaded LLM Framework for Cost-effective Human-AI Decision-Making [55.2480439325792]
We present a cascaded LLM decision framework that adaptively delegates tasks across multiple tiers of expertise.<n>First, a deferral policy determines whether to accept the base model's answer or regenerate it with the large model.<n>Second, an abstention policy decides whether the cascade model response is sufficiently certain or requires human intervention.
arXiv Detail & Related papers (2025-06-13T15:36:22Z)
The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models [6.187227278086245]
Large language models (LLMs) have demonstrated emerging capabilities in social intelligence, including implicature resolution and theory-of-mind reasoning.<n>In this work, we evaluate whether LLMs at different training stages can accurately infer speaker intentions.<n>We systematically evaluate 22 LLMs across 3 key training stages: after pre-training, supervised fine-tuning (SFT), and preference optimization.
arXiv Detail & Related papers (2025-05-24T04:24:59Z)
Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback [51.26493826461026]
We propose Psi-Arena, an interactive framework for comprehensive assessment and optimization of large language models (LLMs)<n>Arena features realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients.<n>Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives.
arXiv Detail & Related papers (2025-05-06T08:22:51Z)
A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems [93.8285345915925]
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making.<n>With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems.<n>We categorize existing methods along two dimensions: (1) Regimes, which define the stage at which reasoning is achieved; and (2) Architectures, which determine the components involved in the reasoning process.
arXiv Detail & Related papers (2025-04-12T01:27:49Z)
EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models [19.271790170055375]
EssayJudge offers precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations.<n>Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits.
arXiv Detail & Related papers (2025-02-17T15:31:59Z)
Reflection-Bench: Evaluating Epistemic Agency in Large Language Models [10.801745760525838]
Epistemic agency is the ability to flexibly construct, adapt, and monitor beliefs about dynamic environments.<n>We propose Reflection-Bench, a benchmark consisting of seven tasks with long-term relevance and minimization of data leakage.<n>Our findings suggest several promising research directions, including enhancing core cognitive functions, improving cross-functional coordination, and developing adaptive processing mechanisms.
arXiv Detail & Related papers (2024-10-21T17:59:50Z)
ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models [55.301188787490545]
Emotion Support Conversation (ESC) aims to reduce human stress, offer emotional guidance, and enhance human mental and physical well-being. We propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models. We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models.
arXiv Detail & Related papers (2024-06-21T08:03:33Z)
FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models [14.894922829587841]
Emotional Support Conversation (ESC) is a typical dialogue that can effectively assist the user in mitigating emotional pressures. Current non-artificial methodologies face challenges in effectively appraising the emotional support capability. We propose a novel model FEEL, employing Large Language Models (LLMs) as evaluators to assess emotional support capabilities.
arXiv Detail & Related papers (2024-03-23T03:32:26Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)
Better Zero-Shot Reasoning with Role-Play Prompting [10.90357246745529]
Role-play prompting consistently surpasses the standard zero-shot approach across most datasets. This highlights its potential to augment the reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-08-15T11:08:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.