Related papers: NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

URL: http://arxiv.org/abs/2512.06921v1
Date: Sun, 07 Dec 2025 17:00:25 GMT
Title: NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification
Authors: Ziyang Song, Zelin Zang, Xiaofan Ye, Boqiang Xu, Long Bai, Jinlin Wu, Hongliang Ren, Hongbin Liu, Jiebo Luo, Zhen Lei,
Abstract summary: Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding.<n>Neurosurgical Anatomy Benchmark (NeuroABench) is first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain.<n>NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures.
Score: 56.133469598652624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.

Related papers

47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations [10.072653135781207]
This paper presents a benchmark evaluation of 27 large language models (LLMs) on Chinese medical examination questions.<n>Our analysis reveals substantial performance variations among models, with Mixtral-8x7B achieving the highest overall accuracy of 74.25%.<n>The evaluation demonstrates significant performance gaps between medical specialties, with models generally performing better on cardiovascular and neurology questions.
arXiv Detail & Related papers (2025-11-16T06:08:41Z)
MRI-Based Brain Tumor Detection through an Explainable EfficientNetV2 and MLP-Mixer-Attention Architecture [0.0]
Brain tumors are serious health problems that require early diagnosis due to their high mortality rates.<n>The need for automated diagnosis systems is increasing day by day.<n>A robust and explainable Deep Learning model for the classification of brain tumors is proposed.
arXiv Detail & Related papers (2025-09-08T14:08:21Z)
From Promise to Practical Reality: Transforming Diffusion MRI Analysis with Fast Deep Learning Enhancement [35.368152968098194]
FastFOD-Net is an end-to-end deep learning framework enhancing FODs with superior performance and delivering training/inference efficiency for clinical use.<n>This work will facilitate the more widespread adoption of, and build clinical trust in, deep learning based methods for diffusion MRI enhancement.
arXiv Detail & Related papers (2025-08-13T17:56:29Z)
Towards a general-purpose foundation model for fMRI analysis [58.06455456423138]
We introduce NeuroSTORM, a framework that learns from 4D fMRI volumes and enables efficient knowledge transfer across diverse applications.<n>NeuroSTORM is pre-trained on 28.65 million fMRI frames (>9,000 hours) from over 50,000 subjects across multiple centers and ages 5 to 100.<n>It outperforms existing methods across five tasks: age/gender prediction, phenotype prediction, disease diagnosis, fMRI-to-image retrieval, and task-based fMRI.
arXiv Detail & Related papers (2025-06-11T23:51:01Z)
SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence [72.10889173696928]
We propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence.<n>We construct a large-scale multimodal surgical database, SurgVLM-DB, spanning more than 16 surgical types and 18 anatomical structures.<n>Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks.
arXiv Detail & Related papers (2025-06-03T07:44:41Z)
Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons [0.7587293779231332]
The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons ( CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations.<n>This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements.<n>A comprehensive evaluation was conducted using 28 large language models.<n>These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS.
arXiv Detail & Related papers (2025-05-29T14:27:14Z)
Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities [65.66373425605278]
Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events.<n>Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases.<n>This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure.
arXiv Detail & Related papers (2025-04-26T15:37:22Z)
Deep learning approaches to surgical video segmentation and object detection: A Scoping Review [0.0]
We conducted a scoping review of studies on semantic segmentation and object detection of anatomical structures published between 2014 and 2024.<n>The primary objective was to evaluate the state-of-the-art performance of semantic segmentation in surgical videos.<n>The secondary objectives included examining DL models, progress toward clinical applications, and the specific challenges with segmentation of organs/tissues in surgical videos.
arXiv Detail & Related papers (2025-02-23T06:31:23Z)
Segmentation of Mental Foramen in Orthopantomographs: A Deep Learning Approach [1.9193578733126382]
This study aims to accelerate dental procedures, elevating patient care and healthcare efficiency in dentistry. This research used Deep Learning methods to accurately detect and segment the Mental Foramen from panoramic radiograph images.
arXiv Detail & Related papers (2024-08-08T21:40:06Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Medulloblastoma Tumor Classification using Deep Transfer Learning with Multi-Scale EfficientNets [63.62764375279861]
We propose an end-to-end MB tumor classification and explore transfer learning with various input sizes and matching network dimensions. Using a data set with 161 cases, we demonstrate that pre-trained EfficientNets with larger input resolutions lead to significant performance improvements.
arXiv Detail & Related papers (2021-09-10T13:07:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.