Related papers: Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

URL: http://arxiv.org/abs/2505.23477v1
Date: Thu, 29 May 2025 14:27:14 GMT
Title: Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons
Authors: Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Jin Vivian Lee, Daniel Alexander Alber, Karl L. Sangwon, Douglas Kondziolka, Eric Karl Oermann,
Abstract summary: The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons ( CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations.<n>This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements.<n>A comprehensive evaluation was conducted using 28 large language models.<n>These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS.
Score: 0.7587293779231332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in non-clinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. 6 of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with one model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared to proprietary variants when subjected to the added distractors. While current LLMs demonstrate an impressive ability to answer neurosurgery board-like exam questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.

Related papers

A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies [24.732452865928053]
Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions.<n>This study utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to evaluate the performance of 11 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-02-05T18:53:17Z)
NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification [56.133469598652624]
Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding.<n>Neurosurgical Anatomy Benchmark (NeuroABench) is first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain.<n>NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures.
arXiv Detail & Related papers (2025-12-07T17:00:25Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks [21.203358914772465]
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear.<n>We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark designed to probe the limits of multimodal clinical reasoning in neurology.
arXiv Detail & Related papers (2025-09-26T12:20:01Z)
DermINO: Hybrid Pretraining for a Versatile Dermatology Foundation Model [92.66916452260553]
DermNIO is a versatile foundation model for dermatology.<n>It incorporates a novel hybrid pretraining framework that augments the self-supervised learning paradigm.<n>It consistently outperforms state-of-the-art models across a wide range of tasks.
arXiv Detail & Related papers (2025-08-17T00:41:39Z)
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
Naturalistic Language-related Movie-Watching fMRI Task for Detecting Neurocognitive Decline and Disorder [60.84344168388442]
Language-related functional magnetic resonance imaging (fMRI) may be a promising approach for detecting cognitive decline and early NCD.<n>We examined the effectiveness of this task among 97 non-demented Chinese older adults from Hong Kong.<n>The study demonstrated the potential of the naturalistic language-related fMRI task for early detection of aging-related cognitive decline and NCD.
arXiv Detail & Related papers (2025-06-10T16:58:47Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning [3.3482359447109866]
Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks.<n>Their limitations in navigating open-ended clinical scenarios have recently been shown.<n>We present the medical abstraction and reasoning corpus (M-ARC)<n>We find that LLMs, including current state-of-the-art o1 and Gemini models, perform poorly compared to physicians on M-ARC.
arXiv Detail & Related papers (2025-02-05T18:14:27Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine [3.471944921180245]
Large Language Models (LLMs) demonstrate significant potential in the medical domain.<n>They are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE.<n>We created a fictional medical benchmark centered on an imaginary organ, the Glianorex, allowing us to separate memorized knowledge from reasoning ability.
arXiv Detail & Related papers (2024-06-04T15:08:56Z)
Almanac: Retrieval-Augmented Language Models for Clinical Medicine [1.5505279143287174]
We develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality.
arXiv Detail & Related papers (2023-03-01T02:30:11Z)
What Do You See in this Patient? Behavioral Testing of Clinical NLP Models [69.09570726777817]
We introduce an extendable testing framework that evaluates the behavior of clinical outcome models regarding changes of the input. We show that model behavior varies drastically even when fine-tuned on the same data and that allegedly best-performing models have not always learned the most medically plausible patterns.
arXiv Detail & Related papers (2021-11-30T15:52:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.