Mental-LLM: Leveraging Large Language Models for Mental Health
Prediction via Online Text Data
- URL: http://arxiv.org/abs/2307.14385v4
- Date: Sun, 28 Jan 2024 16:54:03 GMT
- Title: Mental-LLM: Leveraging Large Language Models for Mental Health
Prediction via Online Text Data
- Authors: Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James
Hendler, Marzyeh Ghassemi, Anind K. Dey, Dakuo Wang
- Abstract summary: We present a comprehensive evaluation of multiple large language models (LLMs) on various mental health prediction tasks.
We conduct experiments covering zero-shot prompting, few-shot prompting, and instruction fine-tuning.
Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%.
- Score: 42.965788205842465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advances in large language models (LLMs) have empowered a variety of
applications. However, there is still a significant gap in research when it
comes to understanding and enhancing the capabilities of LLMs in the field of
mental health. In this work, we present a comprehensive evaluation of multiple
LLMs on various mental health prediction tasks via online text data, including
Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of
experiments, covering zero-shot prompting, few-shot prompting, and instruction
fine-tuning. The results indicate a promising yet limited performance of LLMs
with zero-shot and few-shot prompt designs for mental health tasks. More
importantly, our experiments show that instruction finetuning can significantly
boost the performance of LLMs for all tasks simultaneously. Our best-finetuned
models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of
GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of
GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the
state-of-the-art task-specific language model. We also conduct an exploratory
case study on LLMs' capability on mental health reasoning tasks, illustrating
the promising capability of certain models such as GPT-4. We summarize our
findings into a set of action guidelines for potential methods to enhance LLMs'
capability for mental health tasks. Meanwhile, we also emphasize the important
limitations before achieving deployability in real-world mental health
settings, such as known racial and gender bias. We highlight the important
ethical risks accompanying this line of research.
Related papers
- MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social Media [31.752563319585196]
Black box models are inflexible when switching between tasks, and their results typically lack explanations.
With the rise of large language models (LLMs), their flexibility has introduced new approaches to the field.
In this paper, we introduce the first multi-task Chinese Social Media Interpretable Mental Health Instructions dataset, consisting of 9K samples.
We also propose MentalGLM series models, the first open-source LLMs designed for explainable mental health analysis targeting Chinese social media.
arXiv Detail & Related papers (2024-10-14T09:29:27Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - Evaluating the Effectiveness of the Foundational Models for Q&A Classification in Mental Health care [0.18416014644193068]
Pre-trained Language Models (PLMs) have the potential to transform mental health support.
This study evaluates the effectiveness of PLMs for classification of Questions and Answers in the domain of mental health care.
arXiv Detail & Related papers (2024-06-23T00:11:07Z) - WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions [46.60244609728416]
Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a litmus test of a model's utility in clinical practice.
We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WDs)
We reveal four surprising results about LMs/LLMs.
arXiv Detail & Related papers (2024-06-17T19:50:40Z) - NegativePrompt: Leveraging Psychology for Large Language Models Enhancement via Negative Emotional Stimuli [21.846500669385193]
Large Language Models (LLMs) have become integral to a wide spectrum of applications.
LLMs possess emotional intelligence, which can be further developed through positive emotional stimuli.
We introduce NegativePrompt, a novel approach underpinned by psychological principles.
arXiv Detail & Related papers (2024-05-05T05:06:07Z) - A Novel Nuanced Conversation Evaluation Framework for Large Language Models in Mental Health [42.711913023646915]
We propose a novel framework for evaluating the nuanced conversation abilities of Large Language Models (LLMs)
Within it, we develop a series of quantitative metrics developed from literature on using psychotherapy conversation analysis literature.
We use our framework to evaluate several popular frontier LLMs, including some GPT and Llama models, through a verified mental health dataset.
arXiv Detail & Related papers (2024-03-08T23:46:37Z) - Benchmarking Hallucination in Large Language Models based on
Unanswerable Math Word Problem [58.3723958800254]
Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks.
They are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination.
This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP)
arXiv Detail & Related papers (2024-03-06T09:06:34Z) - Language Models Hallucinate, but May Excel at Fact Verification [89.0833981569957]
Large language models (LLMs) frequently "hallucinate," resulting in non-factual outputs.
Even GPT-3.5 produces factual outputs less than 25% of the time.
This underscores the importance of fact verifiers in order to measure and incentivize progress.
arXiv Detail & Related papers (2023-10-23T04:39:01Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Towards Interpretable Mental Health Analysis with Large Language Models [27.776003210275608]
We evaluate the mental health analysis and emotional reasoning ability of large language models (LLMs) on 11 datasets across 5 tasks.
Based on prompts, we explore LLMs for interpretable mental health analysis by instructing them to generate explanations for each of their decisions.
We convey strict human evaluations to assess the quality of the generated explanations, leading to a novel dataset with 163 human-assessed explanations.
arXiv Detail & Related papers (2023-04-06T19:53:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.