Can AI Relate: Testing Large Language Model Response for Mental Health Support
- URL: http://arxiv.org/abs/2405.12021v2
- Date: Mon, 07 Oct 2024 18:34:56 GMT
- Title: Can AI Relate: Testing Large Language Model Response for Mental Health Support
- Authors: Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, Marzyeh Ghassemi,
- Abstract summary: Large language models (LLMs) are already being piloted for clinical use in hospital systems like NYU Langone, Dana-Farber and the NHS.
We develop an evaluation framework for determining whether LLM response is a viable and ethical path forward for the automation of mental health treatment.
- Score: 23.97212082563385
- License:
- Abstract: Large language models (LLMs) are already being piloted for clinical use in hospital systems like NYU Langone, Dana-Farber and the NHS. A proposed deployment use case is psychotherapy, where a LLM-powered chatbot can treat a patient undergoing a mental health crisis. Deployment of LLMs for mental health response could hypothetically broaden access to psychotherapy and provide new possibilities for personalizing care. However, recent high-profile failures, like damaging dieting advice offered by the Tessa chatbot to patients with eating disorders, have led to doubt about their reliability in high-stakes and safety-critical settings. In this work, we develop an evaluation framework for determining whether LLM response is a viable and ethical path forward for the automation of mental health treatment. Our framework measures equity in empathy and adherence of LLM responses to motivational interviewing theory. Using human evaluation with trained clinicians and automatic quality-of-care metrics grounded in psychology research, we compare the responses provided by peer-to-peer responders to those provided by a state-of-the-art LLM. We show that LLMs like GPT-4 use implicit and explicit cues to infer patient demographics like race. We then show that there are statistically significant discrepancies between patient subgroups: Responses to Black posters consistently have lower empathy than for any other demographic group (2%-13% lower than the control group). Promisingly, we do find that the manner in which responses are generated significantly impacts the quality of the response. We conclude by proposing safety guidelines for the potential deployment of LLMs for mental health response.
Related papers
- LLM Internal States Reveal Hallucination Risk Faced With a Query [62.29558761326031]
Humans have a self-awareness process that allows us to recognize what we don't know when faced with queries.
This paper investigates whether Large Language Models can estimate their own hallucination risk before response generation.
By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32% at run time.
arXiv Detail & Related papers (2024-07-03T17:08:52Z) - Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles [58.82161879559716]
We develop Roleplay-doh, a novel human-LLM collaboration pipeline that elicits qualitative feedback from a domain-expert.
We apply this pipeline to enable senior mental health supporters to create customized AI patients for simulated practice partners.
arXiv Detail & Related papers (2024-07-01T00:43:02Z) - Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants.
This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation.
We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - WundtGPT: Shaping Large Language Models To Be An Empathetic, Proactive Psychologist [8.476124415001598]
WundtGPT is an empathetic and proactive mental health large language model.
It is designed to assist psychologists in diagnosis and help patients who are reluctant to communicate face-to-face understand their psychological conditions.
arXiv Detail & Related papers (2024-06-16T16:06:38Z) - LLM Questionnaire Completion for Automatic Psychiatric Assessment [49.1574468325115]
We employ a Large Language Model (LLM) to convert unstructured psychological interviews into structured questionnaires spanning various psychiatric and personality domains.
The obtained answers are coded as features, which are used to predict standardized psychiatric measures of depression (PHQ-8) and PTSD (PCL-C)
arXiv Detail & Related papers (2024-06-09T09:03:11Z) - Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions [9.327472312657392]
The integration of Large Language Models (LLMs) into the healthcare domain has the potential to significantly enhance patient care and support.
This study investigates the question Can ChatGPT respond with a greater degree of empathy than those typically offered by physicians?
We collect a de-identified dataset of patient messages and physician responses from Mayo Clinic and generate alternative replies using ChatGPT.
arXiv Detail & Related papers (2024-05-26T01:58:57Z) - Large Language Models are Capable of Offering Cognitive Reappraisal, if Guided [38.11184388388781]
Large language models (LLMs) have offered new opportunities for emotional support.
This work takes a first step by engaging with cognitive reappraisals.
We conduct a first-of-its-kind expert evaluation of an LLM's zero-shot ability to generate cognitive reappraisal responses.
arXiv Detail & Related papers (2024-04-01T17:56:30Z) - Aligning Large Language Models for Enhancing Psychiatric Interviews through Symptom Delineation and Summarization [13.77580842967173]
This research contributes to the nascent field of applying Large Language Models to psychiatric interviews.
We analyze counseling data from North Korean defectors with traumatic events and mental health issues.
Our experimental results show that appropriately prompted LLMs can achieve high performance on both the symptom delineation task and the summarization task.
arXiv Detail & Related papers (2024-03-26T06:50:04Z) - A Novel Nuanced Conversation Evaluation Framework for Large Language Models in Mental Health [42.711913023646915]
We propose a novel framework for evaluating the nuanced conversation abilities of Large Language Models (LLMs)
Within it, we develop a series of quantitative metrics developed from literature on using psychotherapy conversation analysis literature.
We use our framework to evaluate several popular frontier LLMs, including some GPT and Llama models, through a verified mental health dataset.
arXiv Detail & Related papers (2024-03-08T23:46:37Z) - Inducing anxiety in large language models can induce bias [47.85323153767388]
We focus on twelve established large language models (LLMs) and subject them to a questionnaire commonly used in psychiatry.
Our results show that six of the latest LLMs respond robustly to the anxiety questionnaire, producing comparable anxiety scores to humans.
Anxiety-induction not only influences LLMs' scores on an anxiety questionnaire but also influences their behavior in a previously-established benchmark measuring biases such as racism and ageism.
arXiv Detail & Related papers (2023-04-21T16:29:43Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.