Related papers: Multi-Stage Boundary-Aware Transformer Network for Action Segmentation in Untrimmed Surgical Videos

Multi-Stage Boundary-Aware Transformer Network for Action Segmentation in Untrimmed Surgical Videos

URL: http://arxiv.org/abs/2504.18756v2
Date: Tue, 10 Jun 2025 22:05:32 GMT
Title: Multi-Stage Boundary-Aware Transformer Network for Action Segmentation in Untrimmed Surgical Videos
Authors: Rezowan Shuvo, M S Mekala, Eyad Elyan,
Abstract summary: Capturing and analyzing long sequences of actions in surgical settings is challenging due to the inherent variability in individual surgeon approaches.<n>This variability complicates the identification and segmentation of distinct actions with ambiguous boundary start and end points.<n>We propose the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention to improve action segmentation.
Score: 0.1053373860696675
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding actions within surgical workflows is critical for evaluating post-operative outcomes and enhancing surgical training and efficiency. Capturing and analyzing long sequences of actions in surgical settings is challenging due to the inherent variability in individual surgeon approaches, which are shaped by their expertise and preferences. This variability complicates the identification and segmentation of distinct actions with ambiguous boundary start and end points. The traditional models, such as MS-TCN, which rely on large receptive fields, that causes over-segmentation, or under-segmentation, where distinct actions are incorrectly aligned. To address these challenges, we propose the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention to improve action segmentation. Our approach effectively manages the complexity of varying action durations and subtle transitions by accurately identifying start and end action boundaries in untrimmed surgical videos. MSBATN introduces a novel unified loss function that optimises action classification and boundary detection as interconnected tasks. Unlike conventional binary boundary detection methods, our innovative boundary weighing mechanism leverages contextual information to precisely identify action boundaries. Extensive experiments on three challenging surgical datasets demonstrate that MSBATN achieves state-of-the-art performance, with superior F1 scores at 25% and 50%. thresholds and competitive results across other metrics.

Related papers

EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery [11.286605039002419]
Endoscopic surgery is the gold standard for robotic-assisted minimally invasive surgery.<n>Traditional deep learning models often struggle with cross-activity interference, leading to suboptimal performance in each downstream task.<n>We propose EndoARSS, a novel multi-task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation.
arXiv Detail & Related papers (2025-06-07T15:18:43Z)
MAMBO-NET: Multi-Causal Aware Modeling Backdoor-Intervention Optimization for Medical Image Segmentation Network [51.68708264694361]
Confusion factors can affect medical images, such as complex anatomical variations and imaging modality limitations.<n>We propose a multi-causal aware modeling backdoor-intervention optimization network for medical image segmentation.<n>Our method significantly reduces the influence of confusion factors, leading to enhanced segmentation accuracy.
arXiv Detail & Related papers (2025-05-28T01:40:10Z)
ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking [15.83425997240828]
ReSurgSAM2 is a two-stage referring surgical segmentation framework.<n>It uses cross-modal spatial-temporal Mamba to generate precise detection and segmentation results.<n>It incorporates a diversity-driven memory mechanism that maintains a credible and diverse memory bank, ensuring consistent long-term tracking.
arXiv Detail & Related papers (2025-05-13T13:56:10Z)
WeakSurg: Weakly supervised surgical instrument segmentation using temporal equivariance and semantic continuity [14.448593791011204]
We propose a weakly supervised surgical instrument segmentation with only instrument presence labels. We take the inherent temporal attributes of surgical video into account and extend a two-stage weakly supervised segmentation paradigm. Experiments are validated on two surgical video datasets, including one cholecystectomy surgery benchmark and one real robotic left lateral segment liver surgery dataset.
arXiv Detail & Related papers (2024-03-14T16:39:11Z)
GS-EMA: Integrating Gradient Surgery Exponential Moving Average with Boundary-Aware Contrastive Learning for Enhanced Domain Generalization in Aneurysm Segmentation [41.97669338211682]
We propose a novel domain generalization strategy that employs gradient surgery exponential moving average (GS-EMA) optimization technique and boundary-aware contrastive learning (BACL) Our approach is distinct in its ability to adapt to new, unseen domains by learning domain-invariant features, thereby improving the robustness and accuracy of aneurysm segmentation across diverse clinical datasets.
arXiv Detail & Related papers (2024-02-23T10:02:15Z)
Pixel-Wise Recognition for Holistic Surgical Scene Understanding [33.40319680006502]
This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies dataset.<n>Our benchmark models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity.<n>To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument (TAPIS) model.
arXiv Detail & Related papers (2024-01-20T09:09:52Z)
SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge [72.97934765570069]
We release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP) The aim of the challenge is to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain. A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation.
arXiv Detail & Related papers (2023-12-31T13:32:18Z)
Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory Distillation [73.83178465971552]
The success of automated medical image analysis depends on large-scale and expert-annotated training sets. Unsupervised domain adaptation (UDA) has been raised as a promising approach to alleviate the burden of labeled data collection. We propose optimization trajectory distillation, a unified approach to address the two technical challenges from a new perspective.
arXiv Detail & Related papers (2023-07-27T08:58:05Z)
DIR-AS: Decoupling Individual Identification and Temporal Reasoning for Action Segmentation [84.78383981697377]
Fully supervised action segmentation works on frame-wise action recognition with dense annotations and often suffers from the over-segmentation issue. We develop a novel local-global attention mechanism with temporal pyramid dilation and temporal pyramid pooling for efficient multi-scale attention. We achieve state-of-the-art accuracy, eg, 82.8% (+2.6%) on GTEA and 74.7% (+1.2%) on Breakfast, which demonstrates the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-04-04T20:27:18Z)
CholecTriplet2021: A benchmark challenge for surgical action triplet recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos. We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge. A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z)
TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery [60.439434751619736]
We propose TraSeTR, a Track-to-Segment Transformer that exploits tracking cues to assist surgical instrument segmentation. TraSeTR jointly reasons about the instrument type, location, and identity with instance-level predictions. The effectiveness of our method is demonstrated with state-of-the-art instrument type segmentation results on three public datasets.
arXiv Detail & Related papers (2022-02-17T05:52:18Z)
MIDeepSeg: Minimally Interactive Segmentation of Unseen Objects from Medical Images Using Deep Learning [15.01235930304888]
We propose a novel deep learning-based interactive segmentation method that has high efficiency due to only requiring clicks as user inputs. Our proposed framework achieves accurate results with fewer user interactions and less time compared with state-of-the-art interactive frameworks.
arXiv Detail & Related papers (2021-04-25T14:15:17Z)
MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task. The first stage generates an initial prediction that is refined by the next ones. Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z)
On Evaluating Weakly Supervised Action Segmentation Methods [79.42955857919497]
We focus on two aspects of the use and evaluation of weakly supervised action segmentation approaches. We train each method on the Breakfast dataset 5 times and provide average and standard deviation of the results. Our experiments show that the standard deviation over these repetitions is between 1 and 2.5% and significantly affects the comparison between different approaches.
arXiv Detail & Related papers (2020-05-19T20:30:31Z)
Robust Medical Instrument Segmentation Challenge 2019 [56.148440125599905]
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions. Our challenge was based on a surgical data set comprising 10,040 annotated images acquired from a total of 30 surgical procedures. The results confirm the initial hypothesis, namely that algorithm performance degrades with an increasing domain gap.
arXiv Detail & Related papers (2020-03-23T14:35:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.