Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives
- URL: http://arxiv.org/abs/2512.20298v1
- Date: Tue, 23 Dec 2025 12:05:01 GMT
- Title: Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives
- Authors: Karolina Drożdż, Kacper Dudzic, Anna Sterna, Marcin Moskalewicz,
- Abstract summary: We show that top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points.<n>While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term "narcissism"<n>Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. We present the first direct comparison between state-of-the-art LLMs and mental health professionals in diagnosing Borderline (BPD) and Narcissistic (NPD) Personality Disorders utilizing Polish-language first-person autobiographical accounts. We show that the top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points (65.48% vs. 43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patient's sense of self and temporal experience. Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.
Related papers
- LiveClin: A Live Clinical Benchmark without Leakage [50.45415584327275]
LiveClin is a live benchmark designed for approximating real-world clinical practice.<n>We transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway.<n>Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%.
arXiv Detail & Related papers (2026-02-18T03:59:46Z) - Large Language Models as Simulative Agents for Neurodivergent Adult Psychometric Profiles [0.0]
Adult neurodivergence, including Attention-Deficit/Hyperactivity Disorder (ADHD), high-functioning Autism Spectrum Disorder (ASD), and Cognitive Disengagement Syndrome (CDS)<n>It remains unclear whether Large Language Models (LLMs) can accurately and stably model neurodevelopmental traits rather than broad personality characteristics.<n>This study examines whether LLMs can generate psychometric responses that approximate those of real individuals when grounded in a structured qualitative interview.
arXiv Detail & Related papers (2026-01-16T10:16:58Z) - AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations [7.061237517845673]
Mental health disorders remain among the leading cause of disability worldwide.<n>Conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed.<n>In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases.
arXiv Detail & Related papers (2025-10-16T17:50:04Z) - Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs [3.364244912862208]
We present a large language model (LLM)-based pipeline that infers Big Five personality traits and five patient subjective judgments.<n>The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million.
arXiv Detail & Related papers (2025-10-05T02:16:35Z) - Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI [0.0]
This study examines the capacity of large language models (LLMs) to support qualitative analysis of first-person experience in Borderline Personality Disorder (BPD)<n>Three LLMs were compared to mimic the interpretative style of the original investigators.<n>Results showed variable overlap with the human analysis, from 0 percent in GPT to 42 percent in Claude and 58 percent in Gemini, and a low Jaccard coefficient (0.21-0.28)<n> Gemini's output most closely resembled the human analysis, with validity scores significantly higher than GPT and Claude (p 0.0001), and was judged as human by blinded experts.
arXiv Detail & Related papers (2025-08-26T13:13:47Z) - MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis [58.67342568632529]
MoodAngels is the first specialized multi-agent framework for mood disorder diagnosis.<n>MoodSyn is an open-source dataset of 1,173 synthetic psychiatric cases.
arXiv Detail & Related papers (2025-06-04T09:18:25Z) - Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models [6.916082619621498]
Post-Traumatic Stress Disorder (PTSD) remains underdiagnosed in clinical settings.<n>This study evaluates natural language processing approaches for detecting PTSD from clinical interview transcripts.
arXiv Detail & Related papers (2025-04-01T22:06:28Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders [59.515827458631975]
Mental health disorders are one of the most serious diseases in the world.<n>Privacy concerns limit the accessibility of personalized treatment data.<n>MentalArena is a self-play framework to train language models.
arXiv Detail & Related papers (2024-10-09T13:06:40Z) - WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions [46.60244609728416]
Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a litmus test of a model's utility in clinical practice.
We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WDs)
We reveal four surprising results about LMs/LLMs.
arXiv Detail & Related papers (2024-06-17T19:50:40Z) - LLM Questionnaire Completion for Automatic Psychiatric Assessment [49.1574468325115]
We employ a Large Language Model (LLM) to convert unstructured psychological interviews into structured questionnaires spanning various psychiatric and personality domains.
The obtained answers are coded as features, which are used to predict standardized psychiatric measures of depression (PHQ-8) and PTSD (PCL-C)
arXiv Detail & Related papers (2024-06-09T09:03:11Z) - Empowering Psychotherapy with Large Language Models: Cognitive
Distortion Detection through Diagnosis of Thought Prompting [82.64015366154884]
We study the task of cognitive distortion detection and propose the Diagnosis of Thought (DoT) prompting.
DoT performs diagnosis on the patient's speech via three stages: subjectivity assessment to separate the facts and the thoughts; contrastive reasoning to elicit the reasoning processes supporting and contradicting the thoughts; and schema analysis to summarize the cognition schemas.
Experiments demonstrate that DoT obtains significant improvements over ChatGPT for cognitive distortion detection, while generating high-quality rationales approved by human experts.
arXiv Detail & Related papers (2023-10-11T02:47:21Z) - The Capability of Large Language Models to Measure Psychiatric
Functioning [9.938814639951842]
The Med-PaLM 2 is capable of assessing psychiatric functioning across a range of psychiatric conditions.
The strongest performance was the prediction of depression scores based on standardized assessments.
Results show the potential for general clinical language models to flexibly predict psychiatric risk.
arXiv Detail & Related papers (2023-08-03T15:52:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.