A Comprehensive Evaluation of Large Language Models on Mental Illnesses
- URL: http://arxiv.org/abs/2409.15687v1
- Date: Tue, 24 Sep 2024 02:58:52 GMT
- Title: A Comprehensive Evaluation of Large Language Models on Mental Illnesses
- Authors: Abdelrahman Hanafi, Mohammed Saad, Noureldin Zahran, Radwa J. Hanafy, Mohammed E. Fouda,
- Abstract summary: GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets.
prompt engineering played a crucial role in enhancing model performance.
Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
- Score: 0.8458496687170665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have shown promise in various domains, including healthcare. In this study, we conduct a comprehensive evaluation of LLMs in the context of mental health tasks using social media data. We explore the zero-shot (ZS) and few-shot (FS) capabilities of various LLMs, including GPT-4, Llama 3, Gemini, and others, on tasks such as binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Our evaluation involved 33 models testing 9 main prompt templates across the tasks. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets. Moreover, prompt engineering played a crucial role in enhancing model performance. Notably, the Mixtral 8x22b model showed an improvement of over 20%, while Gemma 7b experienced a similar boost in performance. In the task of disorder severity evaluation, we observed that FS learning significantly improved the model's accuracy, highlighting the importance of contextual examples in complex assessments. Notably, the Phi-3-mini model exhibited a substantial increase in performance, with balanced accuracy improving by over 6.80% and mean average error dropping by nearly 1.3 when moving from ZS to FS learning. In the psychiatric knowledge task, recent models generally outperformed older, larger counterparts, with the Llama 3.1 405b achieving an accuracy of 91.2%. Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering. Furthermore, the ethical guards imposed by many LLM providers hamper the ability to accurately evaluate their performance, due to tendency to not respond to potentially sensitive queries.
Related papers
- STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions [6.19084217044276]
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing.
We introduce the Sensitivity Testing on Offensive Progressions dataset, which includes 450 offensive progressions containing 2,700 unique sentences.
Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%.
arXiv Detail & Related papers (2024-09-20T18:34:38Z) - Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models [0.06555599394344236]
This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology.
We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images.
arXiv Detail & Related papers (2024-08-25T14:50:47Z) - Enhanced Prediction of Ventilator-Associated Pneumonia in Patients with Traumatic Brain Injury Using Advanced Machine Learning Techniques [0.0]
Ventilator-associated pneumonia (VAP) in traumatic brain injury (TBI) patients poses a significant mortality risk.
Timely detection and prognostication of VAP in TBI patients are crucial to improve patient outcomes and alleviate the strain on healthcare resources.
We implemented six machine learning models using the MIMIC-III database.
arXiv Detail & Related papers (2024-08-02T09:44:18Z) - AgEval: A Benchmark for Zero-Shot and Few-Shot Plant Stress Phenotyping with Multimodal LLMs [19.7240633020344]
AgEval is a benchmark comprising 12 diverse plant stress phenotyping tasks.
Our study assesses zero-shot and few-shot in-context learning performance of state-of-the-art models.
arXiv Detail & Related papers (2024-07-29T00:39:51Z) - ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models [43.895478182631116]
Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications.
To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH.
For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools.
The results show the significant challenges presented by the ToolBH benchmark.
arXiv Detail & Related papers (2024-06-28T16:03:30Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models.
We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process.
Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Let's Verify Step by Step [73.58107073356732]
We show that process supervision significantly outperforms outcome supervision for training models to solve problems.
Our model solves 78% of problems from a representative subset of the MATH test set.
We also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
arXiv Detail & Related papers (2023-05-31T17:24:00Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.