Related papers: PsychCounsel-Bench: Evaluating the Psychology Intelligence of Large Language Models

PsychCounsel-Bench: Evaluating the Psychology Intelligence of Large Language Models

URL: http://arxiv.org/abs/2510.01611v3
Date: Mon, 20 Oct 2025 14:59:12 GMT
Title: PsychCounsel-Bench: Evaluating the Psychology Intelligence of Large Language Models
Authors: Min Zeng,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable success across a wide range of industries.<n>Yet, their potential in applications requiring cognitive abilities, such as psychological counseling, remains largely untapped.<n>This paper investigates whether an LLM can effectively take on the role of a psychological counselor.
Score: 7.565556545193657
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable success across a wide range of industries, primarily due to their impressive generative abilities. Yet, their potential in applications requiring cognitive abilities, such as psychological counseling, remains largely untapped. This paper investigates the key question: \textit{Can LLMs be effectively applied to psychological counseling?} To determine whether an LLM can effectively take on the role of a psychological counselor, the first step is to assess whether it meets the qualifications required for such a role, namely the ability to pass the U.S. National Counselor Certification Exam (NCE). This is because, just as a human counselor must pass a certification exam to practice, an LLM must demonstrate sufficient psychological knowledge to meet the standards required for such a role. To address this, we introduce PsychCounsel-Bench, a benchmark grounded in U.S.national counselor examinations, a licensure test for professional counselors that requires about 70\% accuracy to pass. PsychCounsel-Bench comprises approximately 2,252 carefully curated single-choice questions, crafted to require deep understanding and broad enough to cover various sub-disciplines of psychology. This benchmark provides a comprehensive assessment of an LLM's ability to function as a counselor. Our evaluation shows that advanced models such as GPT-4o, Llama3.3-70B, and Gemma3-27B achieve well above the passing threshold, while smaller open-source models (e.g., Qwen2.5-7B, Mistral-7B) remain far below it. These results suggest that only frontier LLMs are currently capable of meeting counseling exam standards, highlighting both the promise and the challenges of developing psychology-oriented LLMs. We release the proposed dataset for public use: https://github.com/cloversjtu/PsychCounsel-Bench

Related papers

Psychological Counseling Ability of Large Language Models [0.6752538702870792]
This study assessed the psychological counseling ability of mainstream LLMs using 1096 psychological counseling skill questions.<n>The correctness rates of the LLMs for Chinese questions were GLM-3 (46.5%), GPT-4 (46.1%), Gemini (45.0%), ERNIE-3.5 (45.7%) and GPT-3.5 (32.9%)<n>A chi-square test indicated significant differences in the LLMs' performance on Chinese and English questions.
arXiv Detail & Related papers (2025-03-01T08:01:25Z)
Preference Learning Unlocks LLMs' Psycho-Counseling Skills [0.7510165488300369]
We create a preference dataset, PsychoCounsel-Preference, which contains 36k high-quality preference comparison pairs.<n>Our best-aligned model, PsychoCounsel-Llama3-8B, achieves an impressive win rate of 87% against GPT-4o.
arXiv Detail & Related papers (2025-02-27T03:50:25Z)
Do Large Language Models Align with Core Mental Health Counseling Competencies? [19.375161727597536]
Large Language Models (LLMs) are a promising solution to the global shortage of mental health professionals.<n>We introduce CounselingBench, a novel NCMHCE-based benchmark evaluating 22 general-purpose and medical-finetuned LLMs.<n>Our results underscore the need for specialized, fine-tuned models aligned with core mental health counseling competencies.
arXiv Detail & Related papers (2024-10-29T18:27:11Z)
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy [67.23830698947637]
We propose a new benchmark, CBT-BENCH, for the systematic evaluation of cognitive behavioral therapy (CBT) assistance.<n>We include three levels of tasks in CBT-BENCH: I: Basic CBT knowledge acquisition, with the task of multiple-choice questions; II: Cognitive model understanding, with the tasks of cognitive distortion classification, primary core belief classification, and fine-grained core belief classification; III: Therapeutic response generation, with the task of generating responses to patient speech in CBT therapy sessions.<n> Experimental results indicate that while LLMs perform well in reciting CBT knowledge, they fall short in complex real-world scenarios
arXiv Detail & Related papers (2024-10-17T04:52:57Z)
Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation. We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
LLM Questionnaire Completion for Automatic Psychiatric Assessment [49.1574468325115]
We employ a Large Language Model (LLM) to convert unstructured psychological interviews into structured questionnaires spanning various psychiatric and personality domains. The obtained answers are coded as features, which are used to predict standardized psychiatric measures of depression (PHQ-8) and PTSD (PCL-C)
arXiv Detail & Related papers (2024-06-09T09:03:11Z)
Large Language Models are Capable of Offering Cognitive Reappraisal, if Guided [38.11184388388781]
Large language models (LLMs) have offered new opportunities for emotional support. This work takes a first step by engaging with cognitive reappraisals. We conduct a first-of-its-kind expert evaluation of an LLM's zero-shot ability to generate cognitive reappraisal responses.
arXiv Detail & Related papers (2024-04-01T17:56:30Z)
PsychoGAT: A Novel Psychological Measurement Paradigm through Interactive Fiction Games with LLM Agents [68.50571379012621]
Psychological measurement is essential for mental health, self-understanding, and personal development. PsychoGAT (Psychological Game AgenTs) achieves statistically significant excellence in psychometric metrics such as reliability, convergent validity, and discriminant validity.
arXiv Detail & Related papers (2024-02-19T18:00:30Z)
ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology [25.845704502964143]
ConceptPsy is designed to evaluate Chinese complex reasoning and knowledge abilities in psychology. This paper presents ConceptPsy, designed to evaluate Chinese complex reasoning and knowledge abilities in psychology.
arXiv Detail & Related papers (2023-11-16T12:43:18Z)
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench [83.41621219298489]
We propose a framework, PsychoBench, for evaluating diverse psychological aspects of Large Language Models (LLMs) PsychoBench classifies these scales into four distinct categories: personality traits, interpersonal relationships, motivational tests, and emotional abilities. We employ a jailbreak approach to bypass the safety alignment protocols and test the intrinsic natures of LLMs.
arXiv Detail & Related papers (2023-10-02T17:46:09Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)
Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored. This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.