Do Large Language Models Align with Core Mental Health Counseling Competencies?
- URL: http://arxiv.org/abs/2410.22446v1
- Date: Tue, 29 Oct 2024 18:27:11 GMT
- Title: Do Large Language Models Align with Core Mental Health Counseling Competencies?
- Authors: Viet Cuong Nguyen, Mohammad Taher, Dongwan Hong, Vinicius Konkolics Possobom, Vibha Thirunellayi Gopalakrishnan, Ekta Raj, Zihang Li, Heather J. Soled, Michael L. Birnbaum, Srijan Kumar, Munmun De Choudhury,
- Abstract summary: CounselingBench is a novel NCMHCE-based benchmark evaluating Large Language Models (LLMs)
We find frontier models exceed minimum thresholds but fall short of expert-level performance.
Our findings highlight the complexities of developing AI systems for mental health counseling.
- Score: 19.375161727597536
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The rapid evolution of Large Language Models (LLMs) offers promising potential to alleviate the global scarcity of mental health professionals. However, LLMs' alignment with essential mental health counseling competencies remains understudied. We introduce CounselingBench, a novel NCMHCE-based benchmark evaluating LLMs across five key mental health counseling competencies. Testing 22 general-purpose and medical-finetuned LLMs, we find frontier models exceed minimum thresholds but fall short of expert-level performance, with significant variations: they excel in Intake, Assessment & Diagnosis yet struggle with Core Counseling Attributes and Professional Practice & Ethics. Medical LLMs surprisingly underperform generalist models accuracy-wise, while at the same time producing slightly higher-quality justifications but making more context-related errors. Our findings highlight the complexities of developing AI systems for mental health counseling, particularly for competencies requiring empathy and contextual understanding. We found that frontier LLMs perform at a level exceeding the minimal required level of aptitude for all key mental health counseling competencies, but fall short of expert-level performance, and that current medical LLMs do not significantly improve upon generalist models in mental health counseling competencies. This underscores the critical need for specialized, mental health counseling-specific fine-tuned LLMs that rigorously aligns with core competencies combined with appropriate human supervision before any responsible real-world deployment can be considered.
Related papers
- Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.
We propose a novel approach utilizing structured medical reasoning.
Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.
We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.
Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare.
This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z) - CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy [67.23830698947637]
We propose a new benchmark, CBT-BENCH, for the systematic evaluation of cognitive behavioral therapy (CBT) assistance.
We include three levels of tasks in CBT-BENCH: I: Basic CBT knowledge acquisition, with the task of multiple-choice questions; II: Cognitive model understanding, with the tasks of cognitive distortion classification, primary core belief classification, and fine-grained core belief classification; III: Therapeutic response generation, with the task of generating responses to patient speech in CBT therapy sessions.
Experimental results indicate that while LLMs perform well in reciting CBT knowledge, they fall short in complex real-world scenarios
arXiv Detail & Related papers (2024-10-17T04:52:57Z) - MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback [6.681247642186701]
We propose a framework for converting medical cases into high-quality USMLE-style questions.
MCQG-SRefine integrates expert-driven prompt engineering with iterative self-critique and self-correction feedback.
We introduce an LLM-as-Judge-based automatic metric to replace the complex and costly expert evaluation process.
arXiv Detail & Related papers (2024-10-17T03:38:29Z) - MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social Media [31.752563319585196]
Black box models are inflexible when switching between tasks, and their results typically lack explanations.
With the rise of large language models (LLMs), their flexibility has introduced new approaches to the field.
In this paper, we introduce the first multi-task Chinese Social Media Interpretable Mental Health Instructions dataset, consisting of 9K samples.
We also propose MentalGLM series models, the first open-source LLMs designed for explainable mental health analysis targeting Chinese social media.
arXiv Detail & Related papers (2024-10-14T09:29:27Z) - MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders [59.515827458631975]
Mental health disorders are one of the most serious diseases in the world.
Privacy concerns limit the accessibility of personalized treatment data.
MentalArena is a self-play framework to train language models.
arXiv Detail & Related papers (2024-10-09T13:06:40Z) - Evaluation of OpenAI o1: Opportunities and Challenges of AGI [112.0812059747033]
o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance.
The model excelled in tasks requiring intricate reasoning and knowledge integration across various fields.
Overall results indicate significant progress towards artificial general intelligence.
arXiv Detail & Related papers (2024-09-27T06:57:00Z) - RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment [54.91736546490813]
We introduce the RuleAlign framework, designed to align Large Language Models with specific diagnostic rules.
We develop a medical dialogue dataset comprising rule-based communications between patients and physicians.
Experimental results demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-08-22T17:44:40Z) - OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities.
These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage.
Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z) - The Impossibility of Fair LLMs [59.424918263776284]
The need for fair AI is increasingly clear in the era of large language models (LLMs)
We review the technical frameworks that machine learning researchers have used to evaluate fairness.
We develop guidelines for the more realistic goal of achieving fairness in particular use cases.
arXiv Detail & Related papers (2024-05-28T04:36:15Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z) - Large Language Model for Mental Health: A Systematic Review [2.9429776664692526]
Large language models (LLMs) have attracted significant attention for potential applications in digital health.
This systematic review focuses on their strengths and limitations in early screening, digital interventions, and clinical applications.
arXiv Detail & Related papers (2024-02-19T17:58:41Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - Large Language Models in Mental Health Care: a Scoping Review [28.635427491110484]
The integration of large language models (LLMs) in mental health care is an emerging field.
There is a need to systematically review the application outcomes and delineate the advantages and limitations in clinical settings.
This review aims to provide a comprehensive overview of the use of LLMs in mental health care, assessing their efficacy, challenges, and potential for future applications.
arXiv Detail & Related papers (2024-01-01T17:35:52Z) - A Computational Framework for Behavioral Assessment of LLM Therapists [8.373981505033864]
ChatGPT and other large language models (LLMs) have greatly increased interest in utilizing LLMs as therapists.
We propose BOLT, a novel computational framework to study the conversational behavior of LLMs when employed as therapists.
We compare the behavior of LLM therapists against that of high- and low-quality human therapy, and study how their behavior can be modulated to better reflect behaviors observed in high-quality therapy.
arXiv Detail & Related papers (2024-01-01T17:32:28Z) - Challenges of Large Language Models for Mental Health Counseling [4.604003661048267]
The global mental health crisis is looming with a rapid increase in mental disorders, limited resources, and the social stigma of seeking treatment.
The application of large language models (LLMs) in the mental health domain raises concerns regarding the accuracy, effectiveness, and reliability of the information provided.
This paper investigates the major challenges associated with the development of LLMs for psychological counseling, including model hallucination, interpretability, bias, privacy, and clinical effectiveness.
arXiv Detail & Related papers (2023-11-23T08:56:41Z) - Rethinking Large Language Models in Mental Health Applications [42.21805311812548]
Large Language Models (LLMs) have become valuable assets in mental health.
This paper offers a perspective on using LLMs in mental health applications.
arXiv Detail & Related papers (2023-11-19T08:40:01Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - MentaLLaMA: Interpretable Mental Health Analysis on Social Media with
Large Language Models [28.62967557368565]
We build the first multi-task and multi-source interpretable mental health instruction dataset on social media, with 105K data samples.
We use expert-written few-shot prompts and collected labels to prompt ChatGPT and obtain explanations from its responses.
Based on the IMHI dataset and LLaMA2 foundation models, we train MentalLLaMA, the first open-source LLM series for interpretable mental health analysis.
arXiv Detail & Related papers (2023-09-24T06:46:08Z) - Psy-LLM: Scaling up Global Mental Health Psychological Services with
AI-based Large Language Models [3.650517404744655]
Psy-LLM framework is an AI-based tool leveraging Large Language Models for question-answering in psychological consultation settings.
Our framework combines pre-trained LLMs with real-world professional Q&A from psychologists and extensively crawled psychological articles.
It serves as a front-end tool for healthcare professionals, allowing them to provide immediate responses and mindfulness activities to alleviate patient stress.
arXiv Detail & Related papers (2023-07-22T06:21:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.