Related papers: Optimizing Large Language Models for Detecting Symptoms of Comorbid Depression or Anxiety in Chronic Diseases: Insights from Patient Messages

Optimizing Large Language Models for Detecting Symptoms of Comorbid Depression or Anxiety in Chronic Diseases: Insights from Patient Messages

URL: http://arxiv.org/abs/2503.11384v1
Date: Fri, 14 Mar 2025 13:27:35 GMT
Title: Optimizing Large Language Models for Detecting Symptoms of Comorbid Depression or Anxiety in Chronic Diseases: Insights from Patient Messages
Authors: Jiyeong Kim, Stephen P. Ma, Michael L. Chen, Isaac R. Galatzer-Levy, John Torous, Peter J. van Roessel, Christopher Sharp, Michael A. Pfeffer, Carolyn I. Rodriguez, Eleni Linos, Jonathan H. Chen,
Abstract summary: Patients with diabetes are at increased risk of comorbid depression or anxiety, complicating their management.<n>This study evaluated the performance of large language models (LLMs) in detecting these symptoms from secure patient messages.
Score: 4.419296403133379
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Patients with diabetes are at increased risk of comorbid depression or anxiety, complicating their management. This study evaluated the performance of large language models (LLMs) in detecting these symptoms from secure patient messages. We applied multiple approaches, including engineered prompts, systemic persona, temperature adjustments, and zero-shot and few-shot learning, to identify the best-performing model and enhance performance. Three out of five LLMs demonstrated excellent performance (over 90% of F-1 and accuracy), with Llama 3.1 405B achieving 93% in both F-1 and accuracy using a zero-shot approach. While LLMs showed promise in binary classification and handling complex metrics like Patient Health Questionnaire-4, inconsistencies in challenging cases warrant further real-life assessment. The findings highlight the potential of LLMs to assist in timely screening and referrals, providing valuable empirical knowledge for real-world triage systems that could improve mental health care for patients with chronic diseases.

Related papers

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding [12.703061322251093]
Small Language Models (SLMs) are privacy-preserving alternatives to Large Language Models (LLMs)<n>This paper investigates the mental health understanding capabilities of current SLMs through systematic evaluation across classification tasks.<n>Our work highlights the potential of SLMs in mental health understanding, showing they can be effective privacy-preserving tools for analyzing sensitive online text data.
arXiv Detail & Related papers (2025-07-09T02:40:02Z)
LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions [0.0]
DeepSeek R1 achieved a disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3 Mini.<n>Both models, however, struggled with Respiratory Disease classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini.
arXiv Detail & Related papers (2025-03-13T15:54:26Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Large Language Models for Patient Comments Multi-Label Classification [3.670008893193884]
This research explores leveraging Large Language Models (LLMs) in conducting Multi-label Text Classification (MLTC) of inpatient comments.<n> GPT-4 Turbo was leveraged to conduct the classification.<n>Using the prompt engineering framework, zero-shot learning, in-context learning, and chain-of-thought prompting were experimented with.
arXiv Detail & Related papers (2024-10-31T00:29:52Z)
AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow [33.8495939261319]
We develop an advanced simulated patient system with AIPatient Knowledge Graph (AIPatient KG) as the input and Reasoning Retrieval-Augmented Generation (Reasoning RAG) as the generation backbone. Reasoning RAG leverages six LLM powered agents spanning tasks including retrieval, KG query generation, abstraction, checker, rewrite, and summarization. Our system also presents high readability (median Flesch Reading Ease 77.23; median Flesch Kincaid Grade 5.6), robustness (ANOVA F-value 0.6126, p>0.1), and stability (ANOVA F-value 0.782, p>0.1)
arXiv Detail & Related papers (2024-09-27T17:17:15Z)
A Comprehensive Evaluation of Large Language Models on Mental Illnesses [0.8458496687170665]
GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets. prompt engineering played a crucial role in enhancing model performance. Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
arXiv Detail & Related papers (2024-09-24T02:58:52Z)
Safe and Interpretable Estimation of Optimal Treatment Regimes [54.257304443780434]
We operationalize a safe and interpretable framework to identify optimal treatment regimes. Our findings support personalized treatment strategies based on a patient's medical history and pharmacological features.
arXiv Detail & Related papers (2023-10-23T19:59:10Z)
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM) Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z)
SPeC: A Soft Prompt-Based Calibration on Performance Variability of Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.