Related papers: A Novel Nuanced Conversation Evaluation Framework for Large Language Models in Mental Health

A Novel Nuanced Conversation Evaluation Framework for Large Language Models in Mental Health

URL: http://arxiv.org/abs/2403.09705v1
Date: Fri, 8 Mar 2024 23:46:37 GMT
Title: A Novel Nuanced Conversation Evaluation Framework for Large Language Models in Mental Health
Authors: Alexander Marrapese, Basem Suleiman, Imdad Ullah, Juno Kim,
Abstract summary: We propose a novel framework for evaluating the nuanced conversation abilities of Large Language Models (LLMs) Within it, we develop a series of quantitative metrics developed from literature on using psychotherapy conversation analysis literature. We use our framework to evaluate several popular frontier LLMs, including some GPT and Llama models, through a verified mental health dataset.
Score: 42.711913023646915
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding the conversation abilities of Large Language Models (LLMs) can help lead to its more cautious and appropriate deployment. This is especially important for safety-critical domains like mental health, where someone's life may depend on the exact wording of a response to an urgent question. In this paper, we propose a novel framework for evaluating the nuanced conversation abilities of LLMs. Within it, we develop a series of quantitative metrics developed from literature on using psychotherapy conversation analysis literature. While we ensure that our framework and metrics are transferable by researchers to relevant adjacent domains, we apply them to the mental health field. We use our framework to evaluate several popular frontier LLMs, including some GPT and Llama models, through a verified mental health dataset. Our results show that GPT4 Turbo can perform significantly more similarly to verified therapists than other selected LLMs. We conduct additional analysis to examine how LLM conversation performance varies across specific mental health topics. Our results indicate that GPT4 Turbo performs well in achieving high correlation with verified therapists in particular topics such as Parenting and Relationships. We believe our contributions will help researchers develop better LLMs that, in turn, will more positively support people's lives.

Related papers

Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models [92.93521294357058]
Narrative therapy helps individuals transform problematic life stories into empowering alternatives.<n>Current approaches lack realism in specialized psychotherapy and fail to capture therapeutic progression over time.<n>Int (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses.
arXiv Detail & Related papers (2025-07-27T11:52:09Z)
Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations [13.064927179032756]
We introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations.<n>We present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings.
arXiv Detail & Related papers (2025-05-26T16:42:02Z)
"It Listens Better Than My Therapist": Exploring Social Media Discourse on LLMs as Mental Health Tool [1.223779595809275]
Large language models (LLMs) offer new capabilities in conversational fluency, empathy simulation, and availability. This study explores how users engage with LLMs as mental health tools by analyzing over 10,000 TikTok comments. Results show that nearly 20% of comments reflect personal use, with these users expressing overwhelmingly positive attitudes.
arXiv Detail & Related papers (2025-04-14T17:37:32Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment. We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews. Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation. We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
Optimizing Psychological Counseling with Instruction-Tuned Large Language Models [9.19192059750618]
This paper explores the application of large language models (LLMs) in psychological counseling. We present a method for instruction tuning LLMs with specialized prompts to enhance their performance in providing empathetic, relevant, and supportive responses.
arXiv Detail & Related papers (2024-06-19T15:13:07Z)
LLM Questionnaire Completion for Automatic Psychiatric Assessment [49.1574468325115]
We employ a Large Language Model (LLM) to convert unstructured psychological interviews into structured questionnaires spanning various psychiatric and personality domains. The obtained answers are coded as features, which are used to predict standardized psychiatric measures of depression (PHQ-8) and PTSD (PCL-C)
arXiv Detail & Related papers (2024-06-09T09:03:11Z)
Can AI Relate: Testing Large Language Model Response for Mental Health Support [23.97212082563385]
Large language models (LLMs) are already being piloted for clinical use in hospital systems like NYU Langone, Dana-Farber and the NHS. We develop an evaluation framework for determining whether LLM response is a viable and ethical path forward for the automation of mental health treatment.
arXiv Detail & Related papers (2024-05-20T13:42:27Z)
A Computational Framework for Behavioral Assessment of LLM Therapists [8.373981505033864]
ChatGPT and other large language models (LLMs) have greatly increased interest in utilizing LLMs as therapists. We propose BOLT, a novel computational framework to study the conversational behavior of LLMs when employed as therapists. We compare the behavior of LLM therapists against that of high- and low-quality human therapy, and study how their behavior can be modulated to better reflect behaviors observed in high-quality therapy.
arXiv Detail & Related papers (2024-01-01T17:32:28Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
Evaluating the Efficacy of Interactive Language Therapy Based on LLM for High-Functioning Autistic Adolescent Psychological Counseling [1.1780706927049207]
This study investigates the efficacy of Large Language Models (LLMs) in interactive language therapy for high-functioning autistic adolescents. LLMs present a novel opportunity to augment traditional psychological counseling methods.
arXiv Detail & Related papers (2023-11-12T07:55:39Z)
PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner. Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z)
Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data [42.965788205842465]
We present a comprehensive evaluation of multiple large language models (LLMs) on various mental health prediction tasks. We conduct experiments covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%.
arXiv Detail & Related papers (2023-07-26T06:00:50Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.