Future Slot Prediction for Unsupervised Object Discovery in Surgical Video
- URL: http://arxiv.org/abs/2507.01882v2
- Date: Tue, 08 Jul 2025 13:44:50 GMT
- Title: Future Slot Prediction for Unsupervised Object Discovery in Surgical Video
- Authors: Guiqiu Liao, Matjaz Jogan, Marcel Hussing, Edward Zhang, Eric Eaton, Daniel A. Hashimoto,
- Abstract summary: Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations.<n>Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low.<n>We propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot.
- Score: 10.984331138780682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations (slots). This enables effective reasoning about objects and events at a low computational cost and is thus applicable to critical healthcare applications, such as real-time interpretation of surgical video. The heterogeneous scenes in real-world applications like surgery are, however, difficult to parse into a meaningful set of slots. Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low. To address this challenge, we propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot initialization. The model achieves state-of-the-art performance on multiple surgical databases, demonstrating that unsupervised object-centric methods can be applied to real-world data and become part of the common arsenal in healthcare applications.
Related papers
- Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion [54.359489807885616]
SurgRef is a motion-guided framework that grounds free-form language expressions in instrument motion, rather than what they look like.<n>To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with densetemporal masks and rich motion expressions.
arXiv Detail & Related papers (2026-01-18T02:14:08Z) - Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential [26.958261975749974]
We propose textitSpikeSurgSeg, the first spike-driven video Transformer framework tailored for surgical scene segmentation.<n>SpikeSurgSeg achieves most mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8times$.
arXiv Detail & Related papers (2025-12-24T17:05:09Z) - Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery [36.192962258966105]
Scene graphs (SGs) provide structured representations crucial for decoding complex, dynamic surgical environments.<n>This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery.<n>Our analysis reveals rapid growth, yet uncovers a critical 'data divide'<n>SGs are maturing into an essential semantic bridge, enabling a new generation of intelligent systems to improve surgical safety, efficiency, and training.
arXiv Detail & Related papers (2025-09-25T09:25:46Z) - Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z) - SASVi - Segment Any Surgical Video [2.330834737588252]
We propose SASVi, a novel re-prompting mechanism based on a frame-wise Mask R-CNN Overseer model.<n>This model automatically re-prompts the foundation model SAM2 when the scene constellation changes.
arXiv Detail & Related papers (2025-02-12T00:29:41Z) - Slot-BERT: Self-supervised Object Discovery in Surgical Video [9.224875902060083]
Slot-BERT scales object discovery seamlessly to long videos of unconstrained lengths.<n>We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures.
arXiv Detail & Related papers (2025-01-21T19:59:22Z) - VISAGE: Video Synthesis using Action Graphs for Surgery [34.21344214645662]
We introduce the novel task of future video generation in laparoscopic surgery.
Our proposed method, VISAGE, leverages the power of action scene graphs to capture the sequential nature of laparoscopic procedures.
Results of our experiments demonstrate high-fidelity video generation for laparoscopy procedures.
arXiv Detail & Related papers (2024-10-23T10:28:17Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - Dynamic Scene Graph Representation for Surgical Video [37.22552586793163]
We exploit scene graphs as a more holistic, semantically meaningful and human-readable way to represent surgical videos.
We create a scene graph dataset from semantic segmentations from the CaDIS and CATARACTS datasets.
We demonstrate the benefits of surgical scene graphs regarding the explainability and robustness of model decisions.
arXiv Detail & Related papers (2023-09-25T21:28:14Z) - Weakly Supervised YOLO Network for Surgical Instrument Localization in Endoscopic Videos [17.304000735410145]
We propose a weakly supervised localization framework named WS-YOLO for surgical instruments.
By leveraging the instrument category information as the weak supervision, our WS-YOLO framework adopts an unsupervised multi-round training strategy for the localization capability training.
We validate our WS-YOLO framework on the Endoscopic Vision Challenge 2023 dataset, which achieves remarkable performance in the weakly supervised surgical instrument localization.
arXiv Detail & Related papers (2023-09-23T15:28:53Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.<n>These methods rely on manually annotated surgical videos to predict a fixed set of object categories.<n>In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - Neural LerPlane Representations for Fast 4D Reconstruction of Deformable
Tissues [52.886545681833596]
LerPlane is a novel method for fast and accurate reconstruction of surgical scenes under a single-viewpoint setting.
LerPlane treats surgical procedures as 4D volumes and factorizes them into explicit 2D planes of static and dynamic fields.
LerPlane shares static fields, significantly reducing the workload of dynamic tissue modeling.
arXiv Detail & Related papers (2023-05-31T14:38:35Z) - Intuitive Surgical SurgToolLoc Challenge Results: 2022-2023 [55.40111320730479]
We have challenged the surgical data science community to solve difficult machine learning problems in the context of advanced RA applications.<n>Here we document the results of these challenges, focusing on surgical tool localization (SurgToolLoc)<n>The publicly released dataset that accompanies these challenges is detailed in a separate paper arXiv:2501.09209.
arXiv Detail & Related papers (2023-05-11T21:44:39Z) - E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with
Transformer-based Stereoscopic Depth Perception [15.927060244702686]
We present an efficient reconstruction pipeline for highly dynamic surgical scenes that runs at 28 fps.
Specifically, we design a transformer-based stereoscopic depth perception for efficient depth estimation.
We evaluate the proposed pipeline on two datasets, the public Hamlyn Centre Endoscopic Video dataset and our in-house DaVinci robotic surgery dataset.
arXiv Detail & Related papers (2021-07-01T05:57:41Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z) - Towards Unsupervised Learning for Instrument Segmentation in Robotic
Surgery with Cycle-Consistent Adversarial Networks [54.00217496410142]
We propose an unpaired image-to-image translation where the goal is to learn the mapping between an input endoscopic image and a corresponding annotation.
Our approach allows to train image segmentation models without the need to acquire expensive annotations.
We test our proposed method on Endovis 2017 challenge dataset and show that it is competitive with supervised segmentation methods.
arXiv Detail & Related papers (2020-07-09T01:39:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.