A generalizable foundation model for intraoperative understanding across surgical procedures
- URL: http://arxiv.org/abs/2602.13633v1
- Date: Sat, 14 Feb 2026 06:52:42 GMT
- Title: A generalizable foundation model for intraoperative understanding across surgical procedures
- Authors: Kanggil Park, Yongjun Jeon, Soyoung Lim, Seonmin Park, Jongmin Shin, Jung Yong Kim, Sehyeon An, Jinsoo Rhu, Jongman Kim, Gyu-Seong Choi, Namkee Oh, Kyu-Hwan Jung,
- Abstract summary: We introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures.<n>ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization.
- Score: 1.0412442875956527
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In minimally invasive surgery, clinical decisions depend on real-time visual interpretation, yet intraoperative perception varies substantially across surgeons and procedures. This variability limits consistent assessment, training, and the development of reliable artificial intelligence systems, as most surgical AI models are designed for narrowly defined tasks and do not generalize across procedures or institutions. Here we introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework. We curated a large and diverse dataset and systematically evaluated multiple representation learning strategies within a unified benchmark. Across 20 downstream tasks and full fine-tuning, frozen-backbone, few-shot and zero-shot settings, ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization. These results suggest a step toward unified representations for surgical scene understanding and support future applications in intraoperative assistance and surgical training assessment.
Related papers
- NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification [56.133469598652624]
Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding.<n>Neurosurgical Anatomy Benchmark (NeuroABench) is first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain.<n>NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures.
arXiv Detail & Related papers (2025-12-07T17:00:25Z) - Surg-SegFormer: A Dual Transformer-Based Model for Holistic Surgical Scene Segmentation [6.285713987996377]
We introduce Surg-SegFormer, a novel prompt-free model that outperforms current state-of-the-art techniques.<n>By providing robust and automated surgical scene comprehension, this model significantly reduces the tutoring burden on expert surgeons.
arXiv Detail & Related papers (2025-07-06T09:04:25Z) - Large-scale Self-supervised Video Foundation Model for Intelligent Surgery [27.418249899272155]
We introduce the first video-level surgical pre-training framework that enables jointtemporal representation learning from large-scale surgical video data.<n>We propose SurgVISTA, a reconstruction-based pre-training method that captures spatial structures and intricate temporal dynamics.<n>In experiments, SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models.
arXiv Detail & Related papers (2025-06-03T09:42:54Z) - EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis [62.00431604976949]
EndoBench is the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice.<n>We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs.<n>Our experiments reveal: proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts.
arXiv Detail & Related papers (2025-05-29T16:14:34Z) - Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z) - SurgXBench: Explainable Vision-Language Model Benchmark for Surgery [4.068223793121694]
Vision-Language Models (VLMs) have brought transformative advances in reasoning across visual and textual modalities.<n>Existing models show limited performance, highlighting the need for benchmark studies to assess their capabilities and limitations.<n>We benchmark the zero-shot performance of several advancedVLMs on two public robotic-assisted laparoscopic datasets for instrument and action classification.
arXiv Detail & Related papers (2025-05-16T00:42:18Z) - Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities [65.66373425605278]
Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events.<n>Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases.<n>This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure.
arXiv Detail & Related papers (2025-04-26T15:37:22Z) - Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence [1.1765603103920352]
Large Vision-Language Models offer a new paradigm for AI-driven image understanding.<n>This flexibility holds particular promise across medicine, where expert-annotated data is scarce.<n>Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI.
arXiv Detail & Related papers (2025-04-03T17:42:56Z) - Quantification of Robotic Surgeries with Vision-Based Deep Learning [45.165919577877695]
We propose a unified deep learning framework, entitled Roboformer, which operates exclusively on videos recorded during surgery.
We validated our framework on four video-based datasets of two commonly-encountered types of steps within minimally-invasive robotic surgeries.
arXiv Detail & Related papers (2022-05-06T06:08:35Z) - CholecTriplet2021: A benchmark challenge for surgical action triplet
recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos.
We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.
A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z) - Towards Unified Surgical Skill Assessment [18.601526803020885]
We propose a unified multi-path framework for automatic surgical skill assessment.
We conduct experiments on the JIGSAWS dataset of simulated surgical tasks, and a new clinical dataset of real laparoscopic surgeries.
arXiv Detail & Related papers (2021-06-02T09:06:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.