Related papers: CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

URL: http://arxiv.org/abs/2511.09407v1
Date: Thu, 13 Nov 2025 01:52:38 GMT
Title: CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling
Authors: Bichen Wang, Yixin Sun, Junzhe Wang, Hao Yang, Xing Fu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin,
Abstract summary: We introduce textbfCARE-Bench, a dynamic and interactive automated benchmark.<n>It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines.<n> CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales.
Score: 44.86705916946909
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce \textbf{CARE-Bench}, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.

Related papers

Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation [0.0]
Small Language Models (SLMs) provide a more efficient alternative to Large Language Models (LLMs)<n>This study investigates instruction-tuned SLMs for context-summarized multi-turn customer-service QA.
arXiv Detail & Related papers (2026-01-31T11:27:25Z)
Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy [28.293009223912602]
Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall.<n>This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment.<n>We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's taxonomy.
arXiv Detail & Related papers (2026-01-28T05:01:11Z)
PsyCLIENT: Client Simulation via Conversational Trajectory Modeling for Trainee Practice and Model Evaluation in Mental Health Counseling [26.381095576860925]
PsyCLIENT is a novel simulation framework grounded in conversational trajectory modeling.<n>We introduce PsyCLIENT-CP, the first open-source Chinese client profile dataset.<n>Code and data will be released to facilitate future research in mental health counseling.
arXiv Detail & Related papers (2026-01-12T08:33:05Z)
OutboundEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Outbound Evaluation of Xbench's Professional-Aligned Series [36.88936933010042]
OutboundEval is a comprehensive benchmark for evaluating large language models (LLMs) in intelligent outbound calling scenarios.<n>We design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics.<n>We introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality.
arXiv Detail & Related papers (2025-10-24T08:27:58Z)
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [94.40918390309186]
evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses.<n>We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts.<n>Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs.
arXiv Detail & Related papers (2025-10-21T17:59:44Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback [51.26493826461026]
We propose Psi-Arena, an interactive framework for comprehensive assessment and optimization of large language models (LLMs)<n>Arena features realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients.<n>Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives.
arXiv Detail & Related papers (2025-05-06T08:22:51Z)
One for All: A General Framework of LLMs-based Multi-Criteria Decision Making on Human Expert Level [7.755152930120769]
We propose an evaluation framework to automatically deal with general complex MCDM problems.<n>Within the framework, we assess the performance of various typical open-source models, as well as commercial models such as Claude and ChatGPT.<n>The experimental results show that the accuracy rates for different applications improve significantly to around 95%, and the performance difference is trivial between different models.
arXiv Detail & Related papers (2025-02-17T06:47:20Z)
ACEBench: Who Wins the Match Point in Tool Usage? [86.79310356779108]
ACEBench is a comprehensive benchmark for assessing tool usage in Large Language Models (LLMs)<n>It categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent.<n>It provides a more granular examination of error causes across different data types.
arXiv Detail & Related papers (2025-01-22T12:59:08Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators [0.0]
Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications. We propose a flexible framework that enables LLMs to interact with system interfaces, summarize constraint concepts, and continually optimize performance metrics. Our framework achieved a $7.78%$ pass rate with the human discriminator and a $6.11%$ pass rate with the LLM-based discriminator.
arXiv Detail & Related papers (2024-10-19T17:27:38Z)
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions [12.455050661682051]
We propose a framework that employs two large language models (LLMs) via role-playing for simulating counselor-client interactions. Our framework involves two LLMs, one acting as a client equipped with a specific and real-life user profile and the other playing the role of an experienced counselor.
arXiv Detail & Related papers (2024-08-28T13:29:59Z)
Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.