SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding
- URL: http://arxiv.org/abs/2511.21339v1
- Date: Wed, 26 Nov 2025 12:44:51 GMT
- Title: SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding
- Authors: Tae-Min Choi, Tae Kyeong Jeong, Garam Kim, Jaemin Lee, Yeongyoon Koh, In Cheul Choi, Jae-Ho Chung, Jong Woong Park, Juyoun Park,
- Abstract summary: We present SurgMLLMBench, a unified benchmark for developing and evaluating interactive multimodal large language models.<n>It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains.<n>It achieves consistent performance across domains and generalizes effectively to unseen datasets.
- Score: 8.20483591990742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.
Related papers
- GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation [1.9981885081131854]
We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark.<n>The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities.
arXiv Detail & Related papers (2026-03-01T13:49:53Z) - VLM-Guided Iterative Refinement for Surgical Image Segmentation with Foundation Models [16.299786004060863]
IR-SIS is an iterative refinement system for surgical image segmentation that accepts natural language descriptions.<n>The system supports clinician-in-the-loop interaction through natural language feedback.<n>Our work establishes the first language-based surgical segmentation framework with adaptive self-refinement capabilities.
arXiv Detail & Related papers (2026-02-09T22:36:36Z) - RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering [50.42577862494645]
We present RAG-IGBench, a benchmark designed to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering.<n>RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content.
arXiv Detail & Related papers (2025-10-11T03:06:39Z) - SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation [4.97436124491469]
We introduce a speech-guided collaborative perception framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs.<n>A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation.<n> instruments themselves serve as interactive pointers to label additional elements of the surgical scene.
arXiv Detail & Related papers (2025-09-12T23:36:52Z) - SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence [72.10889173696928]
We propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence.<n>We construct a large-scale multimodal surgical database, SurgVLM-DB, spanning more than 16 surgical types and 18 anatomical structures.<n>Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks.
arXiv Detail & Related papers (2025-06-03T07:44:41Z) - EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis [62.00431604976949]
EndoBench is the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice.<n>We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs.<n>Our experiments reveal: proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts.
arXiv Detail & Related papers (2025-05-29T16:14:34Z) - SurgXBench: Explainable Vision-Language Model Benchmark for Surgery [4.068223793121694]
Vision-Language Models (VLMs) have brought transformative advances in reasoning across visual and textual modalities.<n>Existing models show limited performance, highlighting the need for benchmark studies to assess their capabilities and limitations.<n>We benchmark the zero-shot performance of several advancedVLMs on two public robotic-assisted laparoscopic datasets for instrument and action classification.
arXiv Detail & Related papers (2025-05-16T00:42:18Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery [15.47190687192761]
We introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios.<n>We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset.
arXiv Detail & Related papers (2024-03-22T08:38:27Z) - LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery [57.358568111574314]
Patient data privacy often restricts the availability of old data when updating the model.
Prior CL studies overlooked two vital problems in the surgical domain.
This paper proposes addressing these problems with a multimodal large language model (LLM) and an adaptive weight assignment methodology.
arXiv Detail & Related papers (2024-02-26T15:35:24Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Pixel-Wise Recognition for Holistic Surgical Scene Understanding [33.40319680006502]
This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies dataset.<n>Our benchmark models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity.<n>To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument (TAPIS) model.
arXiv Detail & Related papers (2024-01-20T09:09:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.