Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback
- URL: http://arxiv.org/abs/2505.03293v1
- Date: Tue, 06 May 2025 08:22:51 GMT
- Title: Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback
- Authors: Shijing Zhu, Zhuang Chen, Guanqun Bi, Binghang Li, Yaxi Deng, Dazhen Wan, Libiao Peng, Xiyao Xiao, Rongsheng Zhang, Tangjie Lv, Zhipeng Hu, FangFang Li, Minlie Huang,
- Abstract summary: We propose Psi-Arena, an interactive framework for comprehensive assessment and optimization of large language models (LLMs)<n>Arena features realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients.<n>Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives.
- Score: 51.26493826461026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown promise in providing scalable mental health support, while evaluating their counseling capability remains crucial to ensure both efficacy and safety. Existing evaluations are limited by the static assessment that focuses on knowledge tests, the single perspective that centers on user experience, and the open-loop framework that lacks actionable feedback. To address these issues, we propose {\Psi}-Arena, an interactive framework for comprehensive assessment and optimization of LLM-based counselors, featuring three key characteristics: (1) Realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients, (2) Tripartite evaluation that integrates assessments from the client, counselor, and supervisor perspectives, and (3) Closed-loop optimization that iteratively improves LLM counselors using diagnostic feedback. Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives. Moreover, reflection-based optimization results in up to a 141% improvement in counseling performance. We hope PsychoArena provides a foundational resource for advancing reliable and human-aligned LLM applications in mental healthcare.
Related papers
- Multi-dimensional Assessment and Explainable Feedback for Counselor Responses to Client Resistance in Text-based Counseling with LLMs [28.919083157390464]
We present a comprehensive pipeline for the multi-dimensional evaluation of human counselors' interventions targeting client resistance in text-based therapy.<n>We introduce a theory-driven framework that decomposes counselor responses into four distinct communication mechanisms.<n>We show that our approach can effectively distinguish the quality of different communication mechanisms.
arXiv Detail & Related papers (2026-02-25T07:05:05Z) - Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation [56.84819098277464]
CoNL is a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play.<n>CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
arXiv Detail & Related papers (2026-01-29T09:41:14Z) - Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation [14.243791046586347]
The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support.<n>This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue.
arXiv Detail & Related papers (2026-01-26T16:04:19Z) - Responsible Evaluation of AI for Mental Health [72.85175110624736]
Current approaches to evaluating AI tools in mental health care are fragmented and poorly aligned with clinical practice, social context, and first-hand user experience.<n>This paper argues for a rethinking of responsible evaluation by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity.
arXiv Detail & Related papers (2026-01-20T12:55:10Z) - MindChat: A Privacy-preserving Large Language Model for Mental Health Support [10.332226758787277]
We present MindChat, a privacy-preserving large language model for mental health support.<n>We also present MindCorpus, a synthetic multi-turn counseling dataset constructed via a multi-agent role-playing framework.
arXiv Detail & Related papers (2026-01-05T10:54:18Z) - MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z) - Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models [92.93521294357058]
Narrative therapy helps individuals transform problematic life stories into empowering alternatives.<n>Current approaches lack realism in specialized psychotherapy and fail to capture therapeutic progression over time.<n>Int (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses.
arXiv Detail & Related papers (2025-07-27T11:52:09Z) - Can You Share Your Story? Modeling Clients' Metacognition and Openness for LLM Therapist Evaluation [8.701508400127342]
Existing evaluation methods rely on client simulators that clearly disclose internal states to the therapist.<n>We introduce MindVoyager, a novel evaluation framework featuring a controllable and realistic client simulator.<n>We further introduce evaluation metrics that assess the exploration ability of LLM therapists by measuring their thorough understanding of client's beliefs and thoughts.
arXiv Detail & Related papers (2025-07-25T19:32:05Z) - Substance over Style: Evaluating Proactive Conversational Coaching Agents [30.981220308858983]
We describe and implement five multi-turn coaching agents that exhibit distinct conversational styles.<n>We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively.
arXiv Detail & Related papers (2025-03-25T03:44:31Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions [12.455050661682051]
We propose a framework that employs two large language models (LLMs) via role-playing for simulating counselor-client interactions.
Our framework involves two LLMs, one acting as a client equipped with a specific and real-life user profile and the other playing the role of an experienced counselor.
arXiv Detail & Related papers (2024-08-28T13:29:59Z) - CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling [27.193022503592342]
We propose CPsyCoun, a report-based multi-turn dialogue reconstruction and evaluation framework for Chinese psychological counseling.
To fully exploit psychological counseling reports, a two-phase approach is devised to construct high-quality dialogues.
A comprehensive evaluation benchmark is developed for the effective automatic evaluation of multi-turn psychological consultations.
arXiv Detail & Related papers (2024-05-26T05:18:00Z) - Exploring the Efficacy of Large Language Models in Summarizing Mental
Health Counseling Sessions: A Benchmark Study [17.32433545370711]
Comprehensive summaries of sessions enable an effective continuity in mental health counseling.
Manual summarization presents a significant challenge, diverting experts' attention from the core counseling process.
This study evaluates the effectiveness of state-of-the-art Large Language Models (LLMs) in selectively summarizing various components of therapy sessions.
arXiv Detail & Related papers (2024-02-29T11:29:47Z) - Understanding the Therapeutic Relationship between Counselors and Clients in Online Text-based Counseling using LLMs [18.605352662843575]
We present an automatic approach using large language models (LLMs) to understand the development of therapeutic alliance in text-based counseling.
We collect a comprehensive counseling dataset and conduct multiple expert evaluations on a subset based on this framework.
Our findings underscore the challenges counselors face in cultivating strong online relationships with clients.
arXiv Detail & Related papers (2024-02-19T09:00:10Z) - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z) - Opportunities of a Machine Learning-based Decision Support System for
Stroke Rehabilitation Assessment [64.52563354823711]
Rehabilitation assessment is critical to determine an adequate intervention for a patient.
Current practices of assessment mainly rely on therapist's experience, and assessment is infrequently executed due to the limited availability of a therapist.
We developed an intelligent decision support system that can identify salient features of assessment using reinforcement learning.
arXiv Detail & Related papers (2020-02-27T17:04:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.