Psychological Counseling Ability of Large Language Models
- URL: http://arxiv.org/abs/2503.07627v1
- Date: Sat, 01 Mar 2025 08:01:25 GMT
- Title: Psychological Counseling Ability of Large Language Models
- Authors: Fangyu Peng, Jingxin Nie,
- Abstract summary: This study assessed the psychological counseling ability of mainstream LLMs using 1096 psychological counseling skill questions.<n>The correctness rates of the LLMs for Chinese questions were GLM-3 (46.5%), GPT-4 (46.1%), Gemini (45.0%), ERNIE-3.5 (45.7%) and GPT-3.5 (32.9%)<n>A chi-square test indicated significant differences in the LLMs' performance on Chinese and English questions.
- Score: 0.6752538702870792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of science and the continuous progress of artificial intelligence technology, Large Language Models (LLMs) have begun to be widely utilized across various fields. However, in the field of psychological counseling, the ability of LLMs have not been systematically assessed. In this study, we assessed the psychological counseling ability of mainstream LLMs using 1096 psychological counseling skill questions which were selected from the Chinese National Counselor Level 3 Examination, including Knowledge-based, Analytical-based, and Application-based question types. The analysis showed that the correctness rates of the LLMs for Chinese questions, in descending order, were GLM-3 (46.5%), GPT-4 (46.1%), Gemini (45.0%), ERNIE-3.5 (45.7%) and GPT-3.5 (32.9%). The correctness rates of the LLMs for English questions, in descending order, were ERNIE-3.5 (43.9%), GPT-4 (40.6%), Gemini (36.6%), GLM-3 (29.9%) and GPT-3.5 (29.5%). A chi-square test indicated significant differences in the LLMs' performance on Chinese and English questions. Furthermore, we subsequently utilized the Counselor's Guidebook (Level 3) as a reference for ERNIE-3.5, resulting in a new correctness rate of 59.6%, a 13.8% improvement over its initial rate of 45.8%. In conclusion, the study assessed the psychological counseling ability of LLMs for the first time, which may provide insights for future enhancement and improvement of psychological counseling ability of LLMs.
Related papers
- OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education [72.40048732210055]
We introduce OmniEduBench, a comprehensive Chinese educational benchmark.<n>The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension.<n>The dataset features a rich variety of question formats, including 11 common exam question types.
arXiv Detail & Related papers (2025-10-30T12:16:29Z) - PsychCounsel-Bench: Evaluating the Psychology Intelligence of Large Language Models [7.565556545193657]
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of industries.<n>Yet, their potential in applications requiring cognitive abilities, such as psychological counseling, remains largely untapped.<n>This paper investigates whether an LLM can effectively take on the role of a psychological counselor.
arXiv Detail & Related papers (2025-10-02T02:49:06Z) - AIPsychoBench: Understanding the Psychometric Differences between LLMs and Humans [15.572185318032139]
Large Language Models (LLMs) with hundreds of billions of parameters have exhibited human-like intelligence by learning from vast amounts of internet-scale data.<n>This paper introduces AIPsychoBench, a specialized benchmark tailored to assess the psychological properties of LLM.
arXiv Detail & Related papers (2025-09-20T04:40:31Z) - It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education [0.7771252627207672]
The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities.
We created a novel benchmark of free-response questions with paired MCQs (FreeMedQA)
Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions.
arXiv Detail & Related papers (2025-03-13T19:42:04Z) - Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators [20.782328949004434]
Large language models (LLMs) have been assessed for general medical knowledge using medical licensing exams.
We evaluate the capability of both medical trainees and LLMs to recommend medical calculators.
arXiv Detail & Related papers (2024-11-08T15:50:19Z) - Evaluating the Effectiveness of the Foundational Models for Q&A Classification in Mental Health care [0.18416014644193068]
Pre-trained Language Models (PLMs) have the potential to transform mental health support.
This study evaluates the effectiveness of PLMs for classification of Questions and Answers in the domain of mental health care.
arXiv Detail & Related papers (2024-06-23T00:11:07Z) - Pragmatic Competence Evaluation of Large Language Models for the Korean Language [0.6757476692230009]
This study evaluates how well Large Language Models (LLMs) understand context-dependent expressions from a pragmatic standpoint, specifically in Korean.
We use both Multiple-Choice Questions (MCQs) for automatic evaluation and Open-Ended Questions (OEQs) assessed by human experts.
arXiv Detail & Related papers (2024-03-19T12:21:20Z) - LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models [46.77647640464652]
Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications.
We propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark.
It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams.
arXiv Detail & Related papers (2024-03-19T10:11:14Z) - Distortions in Judged Spatial Relations in Large Language Models [45.875801135769585]
GPT-4 exhibited superior performance with 55 percent accuracy, followed by GPT-3.5 at 47 percent, and Llama-2 at 45 percent.
The models identified the nearest cardinal direction in most cases, reflecting their associative learning mechanism.
arXiv Detail & Related papers (2024-01-08T20:08:04Z) - Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using
PsychoBench [83.41621219298489]
We propose a framework, PsychoBench, for evaluating diverse psychological aspects of Large Language Models (LLMs)
PsychoBench classifies these scales into four distinct categories: personality traits, interpersonal relationships, motivational tests, and emotional abilities.
We employ a jailbreak approach to bypass the safety alignment protocols and test the intrinsic natures of LLMs.
arXiv Detail & Related papers (2023-10-02T17:46:09Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge
Evaluation [61.56563631219381]
We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge.
Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi- Specialty and Xiezhi-Interdiscipline, both with 15k questions.
arXiv Detail & Related papers (2023-06-09T09:52:05Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.