Related papers: SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis

SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis

URL: http://arxiv.org/abs/2510.12267v1
Date: Tue, 14 Oct 2025 08:19:22 GMT
Title: SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis
Authors: Chenghanyu Zhang, Zekun Li, Peipei Li, Xing Cui, Shuhan Xia, Weixiang Yan, Yiqiao Zhang, Qianyu Zhuang,
Abstract summary: We introduce SpineBench, a benchmark for evaluation of Multimodal Large Language Models (MLLMs) in the spinal domain.<n>SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks.<n>SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets.
Score: 10.36110941054643
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at https://zhangchenghanyu.github.io/SpineBench.github.io/.

Related papers

MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z)
OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks [41.33747208780257]
multimodal large language models (MLLMs) are increasingly assisting in brain imaging analysis.<n>Current brain-oriented visual question-answering (VQA) benchmarks either cover a few imaging modalities or are limited to coarse-grained pathological descriptions.<n>We introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis.
arXiv Detail & Related papers (2025-11-02T08:11:55Z)
SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus [39.664918145306366]
Spine disorders affect 619 million people globally and are a leading cause of disability.<n>We introduce SpineMed, an ecosystem co-designed with practicing spine surgeons.<n>It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning.
arXiv Detail & Related papers (2025-10-03T16:32:02Z)
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs [49.106901743548036]
We present SpinBench, a diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs)<n>Since perspective taking requires multiple cognitive capabilities, SpinBench introduces a set of fine-grained diagnostic categories.<n>Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations.
arXiv Detail & Related papers (2025-09-29T18:48:16Z)
Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning [0.3499870393443268]
Low back pain affects millions worldwide, driving the need for robust diagnostic models.<n>We present LumbarCLIP, a novel framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions.
arXiv Detail & Related papers (2025-09-25T06:52:25Z)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.<n>Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.<n>We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z)
Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases. We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z)
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark. All images in this benchmark are sourced from authentic medical scenarios. We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z)
A Light-weight CNN Model for Efficient Parkinson's Disease Diagnostics [1.382077805849933]
The proposed model consists of a convolution neural network (CNN) to short-term memory (LSTM) to adapt the characteristics of collected time-series signals. Experimental results show that the proposed model achieves a high-quality diagnostic result over multiple evaluation metrics with much fewer parameters and operations.
arXiv Detail & Related papers (2023-02-02T09:49:07Z)
Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization. We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise. We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.