Multi-view Video-Pose Pretraining for Operating Room Surgical Activity Recognition
- URL: http://arxiv.org/abs/2502.13883v1
- Date: Wed, 19 Feb 2025 17:08:04 GMT
- Title: Multi-view Video-Pose Pretraining for Operating Room Surgical Activity Recognition
- Authors: Idris Hamoud, Vinkle Srivastav, Muhammad Abdullah Jamal, Didier Mutter, Omid Mohareri, Nicolas Padoy,
- Abstract summary: Surgical activity recognition is a key computer vision task that detects activities or phases from multi-view camera recordings.
Existing SAR models often fail to account for fine-grained clinician movements and multi-view knowledge.
We propose a novel calibration-free multi-view multi-modal pretraining framework called Multiview Pretraining for Video-Pose Surgical Activity Recognition PreViPS.
- Score: 5.787586057526269
- License:
- Abstract: Understanding the workflow of surgical procedures in complex operating rooms requires a deep understanding of the interactions between clinicians and their environment. Surgical activity recognition (SAR) is a key computer vision task that detects activities or phases from multi-view camera recordings. Existing SAR models often fail to account for fine-grained clinician movements and multi-view knowledge, or they require calibrated multi-view camera setups and advanced point-cloud processing to obtain better results. In this work, we propose a novel calibration-free multi-view multi-modal pretraining framework called Multiview Pretraining for Video-Pose Surgical Activity Recognition PreViPS, which aligns 2D pose and vision embeddings across camera views. Our model follows CLIP-style dual-encoder architecture: one encoder processes visual features, while the other encodes human pose embeddings. To handle the continuous 2D human pose coordinates, we introduce a tokenized discrete representation to convert the continuous 2D pose coordinates into discrete pose embeddings, thereby enabling efficient integration within the dual-encoder framework. To bridge the gap between these two modalities, we propose several pretraining objectives using cross- and in-modality geometric constraints within the embedding space and incorporating masked pose token prediction strategy to enhance representation learning. Extensive experiments and ablation studies demonstrate improvements over the strong baselines, while data-efficiency experiments on two distinct operating room datasets further highlight the effectiveness of our approach. We highlight the benefits of our approach for surgical activity recognition in both multi-view and single-view settings, showcasing its practical applicability in complex surgical environments. Code will be made available at: https://github.com/CAMMA-public/PreViPS.
Related papers
- Col-OLHTR: A Novel Framework for Multimodal Online Handwritten Text Recognition [82.88856416080331]
Online Handwritten Text Recognition (OLHTR) has gained considerable attention for its diverse range of applications.
Current approaches usually treat OLHTR as a sequence recognition task, employing either a single trajectory or image encoder, or multi-stream encoders.
We propose a Collaborative learning-based OLHTR framework, called Col-OLHTR, that learns multimodal features during training while maintaining a single-stream inference process.
arXiv Detail & Related papers (2025-02-10T02:12:24Z) - Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task [47.7670923159071]
This study introduces an innovative semantics DISentanglement and COmposition VERsatile (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks.
The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side.
At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks.
arXiv Detail & Related papers (2024-12-24T04:32:36Z) - SurgPETL: Parameter-Efficient Image-to-Surgical-Video Transfer Learning for Surgical Phase Recognition [9.675072799670458]
"Image pre-training followed by video fine-tuning" for high-dimensional video data poses significant performance bottlenecks.
In this paper, we develop a parameter-efficient transfer learning benchmark SurgPETL for surgical phase recognition.
We conduct extensive experiments with three advanced methods based on ViTs of two distinct scales pre-trained on five large-scale natural and medical datasets.
arXiv Detail & Related papers (2024-09-30T08:33:50Z) - Self-supervised Learning via Cluster Distance Prediction for Operating Room Context Awareness [44.15562068190958]
In the Operating Room, semantic segmentation is at the core of creating robots aware of clinical surroundings.
State-of-the-art semantic segmentation and activity recognition approaches are fully supervised, which is not scalable.
We propose a new 3D self-supervised task for OR scene understanding utilizing OR scene images captured with ToF cameras.
arXiv Detail & Related papers (2024-07-07T17:17:52Z) - Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems.
Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner.
We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space.
We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z) - DVANet: Disentangling View and Action Features for Multi-View Action
Recognition [56.283944756315066]
We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video.
Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
arXiv Detail & Related papers (2023-12-10T01:19:48Z) - Next-generation Surgical Navigation: Marker-less Multi-view 6DoF Pose
Estimation of Surgical Instruments [66.74633676595889]
We present a multi-camera capture setup consisting of static and head-mounted cameras.
Second, we publish a multi-view RGB-D video dataset of ex-vivo spine surgeries, captured in a surgical wet lab and a real operating theatre.
Third, we evaluate three state-of-the-art single-view and multi-view methods for the task of 6DoF pose estimation of surgical instruments.
arXiv Detail & Related papers (2023-05-05T13:42:19Z) - Not End-to-End: Explore Multi-Stage Architecture for Online Surgical
Phase Recognition [11.234115388848284]
We propose a new non end-to-end training strategy for surgical phase recognition task.
For the non end-to-end training strategy, the refinement stage is trained separately with proposed two types of disturbed sequences.
We evaluate three different choices of refinement models to show that our analysis and solution are robust to the choices of specific multi-stage models.
arXiv Detail & Related papers (2021-07-10T11:00:38Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - A Robotic 3D Perception System for Operating Room Environment Awareness [3.830091185868436]
We describe a 3D multi-view perception system for the da Vinci surgical system to enable Operating room (OR) scene understanding and context awareness.
Based on this architecture, a multi-view 3D scene semantic segmentation algorithm is created.
Our proposed architecture has acceptable registration error ($3.3%pm1.4%$ of object-camera distance) and can robustly improve scene segmentation performance.
arXiv Detail & Related papers (2020-03-20T20:27:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.