Surg-SegFormer: A Dual Transformer-Based Model for Holistic Surgical Scene Segmentation
- URL: http://arxiv.org/abs/2507.04304v1
- Date: Sun, 06 Jul 2025 09:04:25 GMT
- Title: Surg-SegFormer: A Dual Transformer-Based Model for Holistic Surgical Scene Segmentation
- Authors: Fatimaelzahraa Ahmed, Muraam Abdel-Ghani, Muhammad Arsalan, Mahmoud Ali, Abdulaziz Al-Ali, Shidin Balakrishnan,
- Abstract summary: We introduce Surg-SegFormer, a novel prompt-free model that outperforms current state-of-the-art techniques.<n>By providing robust and automated surgical scene comprehension, this model significantly reduces the tutoring burden on expert surgeons.
- Score: 6.285713987996377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Holistic surgical scene segmentation in robot-assisted surgery (RAS) enables surgical residents to identify various anatomical tissues, articulated tools, and critical structures, such as veins and vessels. Given the firm intraoperative time constraints, it is challenging for surgeons to provide detailed real-time explanations of the operative field for trainees. This challenge is compounded by the scarcity of expert surgeons relative to trainees, making the unambiguous delineation of go- and no-go zones inconvenient. Therefore, high-performance semantic segmentation models offer a solution by providing clear postoperative analyses of surgical procedures. However, recent advanced segmentation models rely on user-generated prompts, rendering them impractical for lengthy surgical videos that commonly exceed an hour. To address this challenge, we introduce Surg-SegFormer, a novel prompt-free model that outperforms current state-of-the-art techniques. Surg-SegFormer attained a mean Intersection over Union (mIoU) of 0.80 on the EndoVis2018 dataset and 0.54 on the EndoVis2017 dataset. By providing robust and automated surgical scene comprehension, this model significantly reduces the tutoring burden on expert surgeons, empowering residents to independently and effectively understand complex surgical environments.
Related papers
- Large-scale Self-supervised Video Foundation Model for Intelligent Surgery [27.418249899272155]
We introduce the first video-level surgical pre-training framework that enables jointtemporal representation learning from large-scale surgical video data.<n>We propose SurgVISTA, a reconstruction-based pre-training method that captures spatial structures and intricate temporal dynamics.<n>In experiments, SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models.
arXiv Detail & Related papers (2025-06-03T09:42:54Z) - SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence [72.10889173696928]
We propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence.<n>We construct a large-scale multimodal surgical database, SurgVLM-DB, spanning more than 16 surgical types and 18 anatomical structures.<n>Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks.
arXiv Detail & Related papers (2025-06-03T07:44:41Z) - Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z) - Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities [65.66373425605278]
Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events.<n>Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases.<n>This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure.
arXiv Detail & Related papers (2025-04-26T15:37:22Z) - EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding.<n>Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z) - Is Segment Anything Model 2 All You Need for Surgery Video Segmentation? A Systematic Evaluation [25.459372606957736]
In this paper, we systematically evaluate the performance of SAM2 model in zero-shot surgery video segmentation task.<n>We conducted experiments under different configurations, including different prompting strategies, robustness, etc.
arXiv Detail & Related papers (2024-12-31T16:20:05Z) - OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding [26.962250661485967]
OphNet is a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding.
A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 fine-grained operations.
OphNet is about 20 times larger than the largest existing surgical workflow analysis benchmark.
arXiv Detail & Related papers (2024-06-11T17:18:11Z) - Visual-Kinematics Graph Learning for Procedure-agnostic Instrument Tip
Segmentation in Robotic Surgeries [29.201385352740555]
We propose a novel visual-kinematics graph learning framework to accurately segment the instrument tip given various surgical procedures.
Specifically, a graph learning framework is proposed to encode relational features of instrument parts from both image and kinematics.
A cross-modal contrastive loss is designed to incorporate robust geometric prior from kinematics to image for tip segmentation.
arXiv Detail & Related papers (2023-09-02T14:52:58Z) - CholecTriplet2021: A benchmark challenge for surgical action triplet
recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos.
We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.
A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z) - Aggregating Long-Term Context for Learning Laparoscopic and
Robot-Assisted Surgical Workflows [40.48632897750319]
We propose a new temporal network structure that leverages task-specific network representation to collect long-term sufficient statistics.
We demonstrate superior results over existing and novel state-of-the-art segmentation techniques on two laparoscopic cholecystectomy datasets.
arXiv Detail & Related papers (2020-09-01T20:29:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.