Surgical Video Understanding with Label Interpolation
- URL: http://arxiv.org/abs/2509.18802v1
- Date: Tue, 23 Sep 2025 08:49:07 GMT
- Title: Surgical Video Understanding with Label Interpolation
- Authors: Garam Kim, Tae Kyeong Jeong, Juyoun Park,
- Abstract summary: Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons.<n>Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions.<n>We propose a novel framework that combines optical flow-based segmentation label with multi-task learning.
- Score: 3.880707330499936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally invasive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal-spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow-based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.
Related papers
- Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics [2.582839864045357]
Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions.<n>Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures.<n>We propose Text-Augmented Action Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition.
arXiv Detail & Related papers (2026-02-27T16:15:58Z) - Attachment Anchors: A Novel Framework for Laparoscopic Grasping Point Prediction in Colorectal Surgery [19.147229560255]
We introduce attachment anchors, a structured representation that encodes the local geometric and mechanical relationships between tissue and its anatomical attachments in colorectal surgery.<n>This representation reduces uncertainty in grasping point prediction by normalizing surgical scenes into a consistent local reference frame.<n>Experiments on a dataset of 90 colorectal surgeries demonstrate that attachment anchors improve grasping point prediction compared to image-only baselines.
arXiv Detail & Related papers (2026-02-19T12:19:56Z) - CurConMix+: A Unified Spatio-Temporal Framework for Hierarchical Surgical Workflow Understanding [1.0692208281858493]
Surgical action triplet recognition aims to understand fine surgical behaviors by modeling interactions among instruments, actions, and anatomical targets.<n>Despite its clinical importance for workflow analysis and skill assessment, progress has been hindered by severe class imbalance, subtle visual variations, and the interdependence among triplet components.<n>This study builds upon CurConMix, a spatial representation framework, to address these challenges.
arXiv Detail & Related papers (2026-01-18T08:42:10Z) - EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery [11.286605039002419]
Endoscopic surgery is the gold standard for robotic-assisted minimally invasive surgery.<n>Traditional deep learning models often struggle with cross-activity interference, leading to suboptimal performance in each downstream task.<n>We propose EndoARSS, a novel multi-task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation.
arXiv Detail & Related papers (2025-06-07T15:18:43Z) - ProstaTD: Bridging Surgical Triplet from Classification to Fully Supervised Detection [54.270188252068145]
ProstaTD is a large-scale dataset for surgical triplet detection developed from the technically demanding domain of robot-assisted prostatectomy.<n>The dataset comprises 71,775 video frames and 196,490 annotated triplet instances, collected from 21 surgeries performed across multiple institutions.<n>ProstaTD is the largest and most diverse surgical triplet dataset to date, moving the field from simple classification to full detection with precise spatial and temporal boundaries.
arXiv Detail & Related papers (2025-06-01T19:29:39Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - Surgical Temporal Action-aware Network with Sequence Regularization for
Phase Recognition [28.52533700429284]
We propose a Surgical Temporal Action-aware Network with sequence Regularization, named STAR-Net, to recognize surgical phases more accurately from input videos.
MS-STA module integrates visual features with spatial and temporal knowledge of surgical actions at the cost of 2D networks.
Our STAR-Net with MS-STA and DSR can exploit visual features of surgical actions with effective regularization, thereby leading to the superior performance of surgical phase recognition.
arXiv Detail & Related papers (2023-11-21T13:43:16Z) - Phase-Specific Augmented Reality Guidance for Microscopic Cataract
Surgery Using Long-Short Spatiotemporal Aggregation Transformer [14.568834378003707]
Phaemulsification cataract surgery (PCS) is a routine procedure using a surgical microscope.
PCS guidance systems extract valuable information from surgical microscopic videos to enhance proficiency.
Existing PCS guidance systems suffer from non-phasespecific guidance, leading to redundant visual information.
We propose a novel phase-specific augmented reality (AR) guidance system, which offers tailored AR information corresponding to the recognized surgical phase.
arXiv Detail & Related papers (2023-09-11T02:56:56Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - Federated Cycling (FedCy): Semi-supervised Federated Learning of
Surgical Phases [57.90226879210227]
FedCy is a semi-supervised learning (FSSL) method that combines FL and self-supervised learning to exploit a decentralized dataset of both labeled and unlabeled videos.
We demonstrate significant performance gains over state-of-the-art FSSL methods on the task of automatic recognition of surgical phases.
arXiv Detail & Related papers (2022-03-14T17:44:53Z) - Real-time landmark detection for precise endoscopic submucosal
dissection via shape-aware relation network [51.44506007844284]
We propose a shape-aware relation network for accurate and real-time landmark detection in endoscopic submucosal dissection surgery.
We first devise an algorithm to automatically generate relation keypoint heatmaps, which intuitively represent the prior knowledge of spatial relations among landmarks.
We then develop two complementary regularization schemes to progressively incorporate the prior knowledge into the training process.
arXiv Detail & Related papers (2021-11-08T07:57:30Z) - Robust Medical Instrument Segmentation Challenge 2019 [56.148440125599905]
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions.
Our challenge was based on a surgical data set comprising 10,040 annotated images acquired from a total of 30 surgical procedures.
The results confirm the initial hypothesis, namely that algorithm performance degrades with an increasing domain gap.
arXiv Detail & Related papers (2020-03-23T14:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.