Related papers: Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics

Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics

URL: http://arxiv.org/abs/2602.24138v1
Date: Fri, 27 Feb 2026 16:15:58 GMT
Title: Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics
Authors: Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Cesare Stefanini,
Abstract summary: Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions.<n>Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures.<n>We propose Text-Augmented Action Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition.
Score: 2.582839864045357
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures. While effective, this strategy incurs substantial computational and data collection costs. In this work, we question whether such heavy pre-training is truly necessary. We propose Text-Augmented Action Segmentation Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition that extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated directly from the videos. TASOT formulates temporal action segmentation as a multimodal optimal transport problem, where the matching cost is defined as a weighted combination of visual and text-based costs. The visual term captures frame-level appearance similarity, while the text term provides complementary semantic cues, and both are jointly regularized through a temporally consistent unbalanced Gromov-Wasserstein formulation. This design enables effective alignment between video frames and surgical actions without surgical-specific pretraining or external web-scale supervision. We evaluate TASOT on multiple benchmark surgical datasets and observe consistent and substantial improvements over existing zero-shot methods, including StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6). These results demonstrate that fine-grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations, without resorting to increasingly complex pre-training pipelines. The code will be available at https://github.com/omar8ahmed9/TASOT.

Related papers

Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding [32.4892900455388]
We propose video understanding token merging (STIM-TM) method, representing the first dedicated approach for surgical understanding tasks.<n>STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently.<n> operating in a training-free manner, STIM-TM achieves significant efficiency with over $65$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks.
arXiv Detail & Related papers (2025-09-28T06:24:57Z)
Surgical Video Understanding with Label Interpolation [3.880707330499936]
Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons.<n>Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions.<n>We propose a novel framework that combines optical flow-based segmentation label with multi-task learning.
arXiv Detail & Related papers (2025-09-23T08:49:07Z)
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [67.8359850515282]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We show that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z)
Efficient Frame Extraction: A Novel Approach Through Frame Similarity and Surgical Tool Tracking for Video Segmentation [0.0]
We propose a technique that can efficiently eliminate redundant frames to reduce dataset size and computation time.<n>Specifically, we compute the similarity between consecutive frames by tracking the movement of surgical tools.<n>We evaluate the effectiveness of our approach by analyzing datasets obtained through retrospective reviews of cases.
arXiv Detail & Related papers (2025-01-19T19:36:09Z)
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z)
Visual-Kinematics Graph Learning for Procedure-agnostic Instrument Tip Segmentation in Robotic Surgeries [29.201385352740555]
We propose a novel visual-kinematics graph learning framework to accurately segment the instrument tip given various surgical procedures. Specifically, a graph learning framework is proposed to encode relational features of instrument parts from both image and kinematics. A cross-modal contrastive loss is designed to incorporate robust geometric prior from kinematics to image for tip segmentation.
arXiv Detail & Related papers (2023-09-02T14:52:58Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.<n>These methods rely on manually annotated surgical videos to predict a fixed set of object categories.<n>In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)
GLSFormer : Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches. We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z)
LoViT: Long Video Transformer for Surgical Phase Recognition [59.06812739441785]
We present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently.
arXiv Detail & Related papers (2023-05-15T20:06:14Z)
Trans-SVNet: Accurate Phase Recognition from Surgical Videos via Hybrid Embedding Aggregation Transformer [57.18185972461453]
We introduce for the first time in surgical workflow analysis Transformer to reconsider the ignored complementary effects of spatial and temporal features for accurate phase recognition. Our framework is lightweight and processes the hybrid embeddings in parallel to achieve a high inference speed.
arXiv Detail & Related papers (2021-03-17T15:12:55Z)
Learning Motion Flows for Semi-supervised Instrument Segmentation from Robotic Surgical Video [64.44583693846751]
We study the semi-supervised instrument segmentation from robotic surgical videos with sparse annotations. By exploiting generated data pairs, our framework can recover and even enhance temporal consistency of training sequences. Results show that our method outperforms the state-of-the-art semisupervised methods by a large margin.
arXiv Detail & Related papers (2020-07-06T02:39:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.