Related papers: Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI

Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI

URL: http://arxiv.org/abs/2508.19008v1
Date: Tue, 26 Aug 2025 13:13:47 GMT
Title: Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI
Authors: Marcin Moskalewicz, Anna Sterna, Marek Pokropski, Paula Flores,
Abstract summary: This study examines the capacity of large language models (LLMs) to support qualitative analysis of first-person experience in Borderline Personality Disorder (BPD)<n>Three LLMs were compared to mimic the interpretative style of the original investigators.<n>Results showed variable overlap with the human analysis, from 0 percent in GPT to 42 percent in Claude and 58 percent in Gemini, and a low Jaccard coefficient (0.21-0.28)<n> Gemini's output most closely resembled the human analysis, with validity scores significantly higher than GPT and Claude (p 0.0001), and was judged as human by blinded experts.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study examines the capacity of large language models (LLMs) to support phenomenological qualitative analysis of first-person experience in Borderline Personality Disorder (BPD), understood as a disorder of temporality and selfhood. Building on a prior human-led thematic analysis of 24 inpatients' life-story interviews, we compared three LLMs (OpenAI GPT-4o, Google Gemini 2.5 Pro, Anthropic Claude Opus 4) prompted to mimic the interpretative style of the original investigators. The models were evaluated with blinded and non-blinded expert judges in phenomenology and clinical psychology. Assessments included semantic congruence, Jaccard coefficients, and multidimensional validity ratings (credibility, coherence, substantiveness, and groundness in data). Results showed variable overlap with the human analysis, from 0 percent in GPT to 42 percent in Claude and 58 percent in Gemini, and a low Jaccard coefficient (0.21-0.28). However, the models recovered themes omitted by humans. Gemini's output most closely resembled the human analysis, with validity scores significantly higher than GPT and Claude (p < 0.0001), and was judged as human by blinded experts. All scores strongly correlated (R > 0.78) with the quantity of text and words per theme, highlighting both the variability and potential of AI-augmented thematic analysis to mitigate human interpretative bias.

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers [0.0]
This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking it against human observers.<n>Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates.
arXiv Detail & Related papers (2026-01-22T11:18:16Z)
HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns [59.17423586203706]
We present HUMANLLM, a framework treating psychological patterns as interacting causal forces.<n>We construct 244 patterns from 12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other.<n>Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment.
arXiv Detail & Related papers (2026-01-15T08:56:53Z)
Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives [0.0]
We show that top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points.<n>While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term "narcissism"<n>Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.
arXiv Detail & Related papers (2025-12-23T12:05:01Z)
The Catastrophic Paradox of Human Cognitive Frameworks in Large Language Model Evaluation: A Comprehensive Empirical Analysis of the CHC-LLM Incompatibility [0.0]
Models achieving above-average human IQ scores simultaneously exhibit binary accuracy rates approaching zero on crystallized knowledge tasks.<n>This disconnect appears most strongly in the crystallized intelligence domain.<n>We propose a framework for developing native machine cognition assessments that recognize the non-human nature of artificial intelligence.
arXiv Detail & Related papers (2025-11-23T05:49:57Z)
Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs [3.364244912862208]
We present a large language model (LLM)-based pipeline that infers Big Five personality traits and five patient subjective judgments.<n>The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million.
arXiv Detail & Related papers (2025-10-05T02:16:35Z)
Measuring How LLMs Internalize Human Psychological Concepts: A preliminary analysis [0.0]
We develop a framework to assess concept alignment between Large Language Models and human psychological dimensions.<n>A GPT-4 model achieved superior classification accuracy (66.2%), significantly outperforming GPT-3.5 (55.9%) and BERT (48.1%)<n>Our findings demonstrate that modern LLMs can approximate human psychological constructs with measurable accuracy.
arXiv Detail & Related papers (2025-06-29T01:56:56Z)
MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis [58.67342568632529]
MoodAngels is the first specialized multi-agent framework for mood disorder diagnosis.<n>MoodSyn is an open-source dataset of 1,173 synthetic psychiatric cases.
arXiv Detail & Related papers (2025-06-04T09:18:25Z)
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? [49.083544963243206]
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning.<n>We introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
arXiv Detail & Related papers (2025-05-21T18:33:50Z)
Medical Hallucinations in Foundation Models and Their Impact on Healthcare [71.15392179084428]
Hallucinations in foundation models arise from autoregressive training objectives.<n>Top-performing models exceeded 97% accuracy when augmented with chain-of-thought prompting.
arXiv Detail & Related papers (2025-02-26T02:30:44Z)
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation.<n>We propose stratifying data by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z)
Do GPT Language Models Suffer From Split Personality Disorder? The Advent Of Substrate-Free Psychometrics [1.1172147007388977]
We provide a state of the art language model with the same personality questionnaire in nine languages. Our results suggest both interlingual and intralingual instabilities, which indicate that current language models do not develop a consistent core personality. This can lead to unsafe behaviour of artificial intelligence systems that are based on these foundation models.
arXiv Detail & Related papers (2024-08-14T08:53:00Z)
Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z)
Automatically measuring speech fluency in people with aphasia: first achievements using read-speech data [55.84746218227712]
This study aims at assessing the relevance of a signalprocessingalgorithm, initially developed in the field of language acquisition, for the automatic measurement of speech fluency.
arXiv Detail & Related papers (2023-08-09T07:51:40Z)
Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4. The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z)
The Consequences of the Framing of Machine Learning Risk Prediction Models: Evaluation of Sepsis in General Wards [0.0]
We evaluate how framing affects model performance and model learning in four different approaches. We analysed structured secondary healthcare data from 221,283 citizens from four Danish municipalities.
arXiv Detail & Related papers (2021-01-26T14:00:05Z)
World Trade Center responders in their own words: Predicting PTSD symptom trajectories with AI-based language analyses of interviews [6.700088567524812]
This study tested the ability of AI-based language assessments to predict PTSD symptom trajectories among responders. Cross-sectionally, greater depressive language (beta=0.32; p43) and first-person singular usage (beta=0.31; p44) were associated with increased symptom severity. Longer words usage (beta=-0.36; p7) and longer words usage (beta=-0.36; p7) predicted improvement.
arXiv Detail & Related papers (2020-11-12T15:57:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.