A Comprehensive Evaluation of Large Language Models on Mental Illnesses
- URL: http://arxiv.org/abs/2409.15687v1
- Date: Tue, 24 Sep 2024 02:58:52 GMT
- Title: A Comprehensive Evaluation of Large Language Models on Mental Illnesses
- Authors: Abdelrahman Hanafi, Mohammed Saad, Noureldin Zahran, Radwa J. Hanafy, Mohammed E. Fouda,
- Abstract summary: GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets.
prompt engineering played a crucial role in enhancing model performance.
Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
- Score: 0.8458496687170665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have shown promise in various domains, including healthcare. In this study, we conduct a comprehensive evaluation of LLMs in the context of mental health tasks using social media data. We explore the zero-shot (ZS) and few-shot (FS) capabilities of various LLMs, including GPT-4, Llama 3, Gemini, and others, on tasks such as binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Our evaluation involved 33 models testing 9 main prompt templates across the tasks. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets. Moreover, prompt engineering played a crucial role in enhancing model performance. Notably, the Mixtral 8x22b model showed an improvement of over 20%, while Gemma 7b experienced a similar boost in performance. In the task of disorder severity evaluation, we observed that FS learning significantly improved the model's accuracy, highlighting the importance of contextual examples in complex assessments. Notably, the Phi-3-mini model exhibited a substantial increase in performance, with balanced accuracy improving by over 6.80% and mean average error dropping by nearly 1.3 when moving from ZS to FS learning. In the psychiatric knowledge task, recent models generally outperformed older, larger counterparts, with the Llama 3.1 405b achieving an accuracy of 91.2%. Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering. Furthermore, the ethical guards imposed by many LLM providers hamper the ability to accurately evaluate their performance, due to tendency to not respond to potentially sensitive queries.
Related papers
- Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision.
Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
arXiv Detail & Related papers (2024-11-25T18:28:26Z) - Effective Predictive Modeling for Emergency Department Visits and Evaluating Exogenous Variables Impact: Using Explainable Meta-learning Gradient Boosting [0.18846515534317265]
We introduce a novel Meta-learning Gradient Booster (Meta-ED) approach for precisely forecasting daily ED visits.
The proposed Meta-ED consists of four foundational learners-Catboost, Random Forest, Extra Tree, and lightGBoost-alongside a dependable top-level learner, Multi-Layer Perceptron (MLP)
Our study assesses the efficacy of the Meta-ED model through an extensive comparative analysis involving 23 models.
arXiv Detail & Related papers (2024-11-18T04:23:20Z) - The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [64.36534512742736]
We investigate the effectiveness of test-time training (TTT) as a mechanism for improving models' reasoning capabilities.
TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models.
Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models [0.06555599394344236]
This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology.
We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images.
arXiv Detail & Related papers (2024-08-25T14:50:47Z) - Enhanced Prediction of Ventilator-Associated Pneumonia in Patients with Traumatic Brain Injury Using Advanced Machine Learning Techniques [0.0]
Ventilator-associated pneumonia (VAP) in traumatic brain injury (TBI) patients poses a significant mortality risk.
Timely detection and prognostication of VAP in TBI patients are crucial to improve patient outcomes and alleviate the strain on healthcare resources.
We implemented six machine learning models using the MIMIC-III database.
arXiv Detail & Related papers (2024-08-02T09:44:18Z) - AgEval: A Benchmark for Zero-Shot and Few-Shot Plant Stress Phenotyping with Multimodal LLMs [19.7240633020344]
AgEval is a benchmark comprising 12 diverse plant stress phenotyping tasks.
Our study assesses zero-shot and few-shot in-context learning performance of state-of-the-art models.
arXiv Detail & Related papers (2024-07-29T00:39:51Z) - ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models [43.895478182631116]
Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications.
To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH.
For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools.
The results show the significant challenges presented by the ToolBH benchmark.
arXiv Detail & Related papers (2024-06-28T16:03:30Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Let's Verify Step by Step [73.58107073356732]
We show that process supervision significantly outperforms outcome supervision for training models to solve problems.
Our model solves 78% of problems from a representative subset of the MATH test set.
We also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
arXiv Detail & Related papers (2023-05-31T17:24:00Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.