Related papers: WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions

WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions

URL: http://arxiv.org/abs/2406.12058v4
Date: Mon, 07 Oct 2024 14:08:13 GMT
Title: WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions
Authors: Seyedali Mohammadi, Edward Raff, Jinendra Malekar, Vedant Palit, Francis Ferraro, Manas Gaur,
Abstract summary: Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a litmus test of a model's utility in clinical practice. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WDs) We reveal four surprising results about LMs/LLMs.
Score: 46.60244609728416
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a sufficient litmus test of a model's utility in clinical practice. A model that can be trusted for practice should have a correspondence between explanation and clinical determination, yet no prior research has examined the attention fidelity of these models and their effect on ground truth explanations. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WDs). We focus on two existing mental health and well-being datasets: (a) Multi-label Classification-based MultiWD, and (b) WellXplain for evaluating attention mechanism veracity against expert-labeled explanations. The labels are based on Halbert Dunn's theory of wellness, which gives grounding to our evaluation. We reveal four surprising results about LMs/LLMs: (1) Despite their human-like capabilities, GPT-3.5/4 lag behind RoBERTa, and MedAlpaca, a fine-tuned LLM on WellXplain fails to deliver any remarkable improvements in performance or explanations. (2) Re-examining LMs' predictions based on a confidence-oriented loss function reveals a significant performance drop. (3) Across all LMs/LLMs, the alignment between attention and explanations remains low, with LLMs scoring a dismal 0.0. (4) Most mental health-specific LMs/LLMs overlook domain-specific knowledge and undervalue explanations, causing these discrepancies. This study highlights the need for further research into their consistency and explanations in mental health and well-being.

Related papers

A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies [24.732452865928053]
Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions.<n>This study utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to evaluate the performance of 11 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-02-05T18:53:17Z)
Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models? [8.042664286747419]
Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions.<n>We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment.
arXiv Detail & Related papers (2025-10-28T09:43:49Z)
An Empirical Analysis of VLM-based OOD Detection: Mechanisms, Advantages, and Sensitivity [104.05991573442805]
Vision-Language Models (VLMs) have demonstrated remarkable zero-shot out-of-distribution (OOD) detection capabilities.<n>This paper presents a systematic empirical analysis of VLM-based OOD detection using in-distribution (ID) and OOD prompts.
arXiv Detail & Related papers (2025-09-16T06:11:02Z)
A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLMs [0.0]
Early detection of depression from online social media posts holds promise for providing timely mental health interventions.<n>We present a high-quality, expert-annotated dataset of 1,017 social media posts labeled with depressive spans and mapped to 12 depression symptom categories.
arXiv Detail & Related papers (2025-07-26T10:01:55Z)
DeVisE: Behavioral Testing of Medical Large Language Models [14.832083455439749]
DeVisE is a behavioral testing framework for probing fine-grained clinical understanding.<n>We construct a dataset of ICU discharge notes from MIMIC-IV.<n>We evaluate five LLMs spanning general-purpose and medically fine-tuned variants.
arXiv Detail & Related papers (2025-06-18T10:42:22Z)
Aligned Probing: Relating Toxic Behavior and Model Internals [66.49887503194101]
We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs) Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers.
arXiv Detail & Related papers (2025-03-17T17:23:50Z)
Cognitive-Mental-LLM: Evaluating Reasoning in Large Language Models for Mental Health Prediction via Online Text [0.0]
This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases.
arXiv Detail & Related papers (2025-03-13T06:42:37Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references. We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Quantifying depressive mental states with large language models [0.0]
Large Language Models (LLMs) may have an important role to play in mental health.<n>We outline and evaluate LLM performance on three critical tests.
arXiv Detail & Related papers (2025-02-13T16:52:06Z)
HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses [0.12499537119440242]
This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx. The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors.
arXiv Detail & Related papers (2025-02-12T04:17:02Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment. We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews. Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models [51.63547465454027]
Language models (LMs) are essential for reliable decision-making in fields like healthcare, law, and journalism. This study systematically evaluates the capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data.
arXiv Detail & Related papers (2024-10-28T16:38:20Z)
MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social Media [31.752563319585196]
Black box models are inflexible when switching between tasks, and their results typically lack explanations. With the rise of large language models (LLMs), their flexibility has introduced new approaches to the field. In this paper, we introduce the first multi-task Chinese Social Media Interpretable Mental Health Instructions dataset, consisting of 9K samples. We also propose MentalGLM series models, the first open-source LLMs designed for explainable mental health analysis targeting Chinese social media.
arXiv Detail & Related papers (2024-10-14T09:29:27Z)
SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research [45.2233252981348]
Large Language Models have shown promising results in their ability to encode general medical knowledge. We test the ability of state-of-the-art LLMs to leverage their internal knowledge and reasoning for epilepsy diagnosis.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
Large Language Models in Medical Term Classification and Unexpected Misalignment Between Response and Reasoning [28.355000184014084]
This study assesses the ability of state-of-the-art large language models (LLMs) to identify patients with mild cognitive impairment (MCI) from discharge summaries. The data was partitioned into training, validation, and testing sets in a 7:2:1 ratio for model fine-tuning and evaluation. Open-source models like Falcon and LLaMA 2 achieved high accuracy but lacked explanatory reasoning.
arXiv Detail & Related papers (2023-12-19T17:36:48Z)
Language Models Hallucinate, but May Excel at Fact Verification [89.0833981569957]
Large language models (LLMs) frequently "hallucinate," resulting in non-factual outputs. Even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress.
arXiv Detail & Related papers (2023-10-23T04:39:01Z)
MentaLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models [28.62967557368565]
We build the first multi-task and multi-source interpretable mental health instruction dataset on social media, with 105K data samples. We use expert-written few-shot prompts and collected labels to prompt ChatGPT and obtain explanations from its responses. Based on the IMHI dataset and LLaMA2 foundation models, we train MentalLLaMA, the first open-source LLM series for interpretable mental health analysis.
arXiv Detail & Related papers (2023-09-24T06:46:08Z)
Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence Affect Language Models [74.07684768317705]
LMs are highly sensitive to markers of certainty in prompts, with accuies varying more than 80%. We find that expressions of high certainty result in a decrease in accuracy as compared to low expressions; similarly, factive verbs hurt performance, while evidentials benefit performance. These associations may suggest that LMs is based on observed language use, rather than truly reflecting uncertainty.
arXiv Detail & Related papers (2023-02-26T23:46:29Z)
Explainability of Traditional and Deep Learning Models on Longitudinal Healthcare Records [0.0]
Rigorous evaluation of explainability is often missing, as comparisons between models and various explainability methods have not been well-studied. Our work is one of the first to evaluate explainability performance between and within traditional (XGBoost) and deep learning (LSTM with Attention) models on both a global and individual per-prediction level.
arXiv Detail & Related papers (2022-11-22T04:39:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.