SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
- URL: http://arxiv.org/abs/2510.03160v2
- Date: Sat, 25 Oct 2025 03:34:09 GMT
- Title: SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
- Authors: Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongyan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan,
- Abstract summary: Spine disorders affect 619 million people globally and are a leading cause of disability.<n>We introduce SpineMed, an ecosystem co-designed with practicing spine surgeons.<n>It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning.
- Score: 39.664918145306366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.
Related papers
- Multi-View Stenosis Classification Leveraging Transformer-Based Multiple-Instance Learning Using Real-World Clinical Data [76.89269238957593]
Coronary artery stenosis is a leading cause of cardiovascular disease, diagnosed by analyzing the coronary arteries from multiple angiography views.<n>We propose SegmentMIL, a transformer-based multi-view multiple-instance learning framework for patient-level stenosis classification.
arXiv Detail & Related papers (2026-02-02T13:07:52Z) - SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis [10.36110941054643]
We introduce SpineBench, a benchmark for evaluation of Multimodal Large Language Models (MLLMs) in the spinal domain.<n>SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks.<n>SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets.
arXiv Detail & Related papers (2025-10-14T08:19:22Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks [21.203358914772465]
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear.<n>We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark designed to probe the limits of multimodal clinical reasoning in neurology.
arXiv Detail & Related papers (2025-09-26T12:20:01Z) - Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning [0.3499870393443268]
Low back pain affects millions worldwide, driving the need for robust diagnostic models.<n>We present LumbarCLIP, a novel framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions.
arXiv Detail & Related papers (2025-09-25T06:52:25Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - A Federated Learning Framework for Handling Subtype Confounding and Heterogeneity in Large-Scale Neuroimaging Diagnosis [22.017120252054625]
We propose a novel federated learning framework tailored for neuroimaging CAD systems.<n>Our approach includes a dynamic navigation module that routes samples to the most suitable local models.<n>We evaluated our framework using fMRI data from over 1300 MDD patients and 1100 healthy controls.
arXiv Detail & Related papers (2025-08-08T07:19:49Z) - Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis [28.192924379673862]
Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS)<n>We propose a comprehensive benchmark of CL detection and segmentation in MRI.<n>We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection.
arXiv Detail & Related papers (2025-07-16T09:56:11Z) - An Agentic System for Rare Disease Diagnosis with Traceable Reasoning [69.46279475491164]
We introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM)<n>DeepRare generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning.<n>The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases.
arXiv Detail & Related papers (2025-06-25T13:42:26Z) - Towards a general-purpose foundation model for fMRI analysis [58.06455456423138]
We introduce NeuroSTORM, a framework that learns from 4D fMRI volumes and enables efficient knowledge transfer across diverse applications.<n>NeuroSTORM is pre-trained on 28.65 million fMRI frames (>9,000 hours) from over 50,000 subjects across multiple centers and ages 5 to 100.<n>It outperforms existing methods across five tasks: age/gender prediction, phenotype prediction, disease diagnosis, fMRI-to-image retrieval, and task-based fMRI.
arXiv Detail & Related papers (2025-06-11T23:51:01Z) - PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks [39.97710183184273]
We present PathOrchestra, a versatile pathology foundation model trained via self-supervised learning on a dataset comprising 300K pathological slides.<n>The model was rigorously evaluated on 112 clinical tasks using a combination of 61 private and 51 public datasets.<n>PathOrchestra demonstrated exceptional performance across 27,755 WSIs and 9,415,729 ROIs, achieving over 0.950 accuracy in 47 tasks.
arXiv Detail & Related papers (2025-03-31T17:28:02Z) - SpineOne: A One-Stage Detection Framework for Degenerative Discs and
Vertebrae [54.751251046196494]
We propose a one-stage detection framework termed SpineOne to simultaneously localize and classify degenerative discs and vertebrae from MRI slices.
SpineOne is built upon the following three key techniques: 1) a new design of the keypoint heatmap to facilitate simultaneous keypoint localization and classification; 2) the use of attention modules to better differentiate the representations between discs and vertebrae; and 3) a novel gradient-guided objective association mechanism to associate multiple learning objectives at the later training stage.
arXiv Detail & Related papers (2021-10-28T12:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.