EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding
- URL: http://arxiv.org/abs/2505.24287v1
- Date: Fri, 30 May 2025 07:02:00 GMT
- Title: EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding
- Authors: Ege Özsoy, Arda Mamur, Felix Tristram, Chantal Pellegrini, Magdalena Wysocki, Benjamin Busam, Nassir Navab,
- Abstract summary: EgoExOR is the first operating room (OR) dataset to fuse first-person and third-person perspectives.<n>It integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery.<n>We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models.
- Score: 43.66860935790616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but do not explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two emulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (568,235 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR's multimodal and multi-perspective signals. This new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception.
Related papers
- EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy [26.132684811981143]
Vision-Language-Action (VLA) models integrate visual perception, language grounding, and motion planning within an end-to-end framework.<n>EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to circular markers during circumferential cutting.
arXiv Detail & Related papers (2025-05-21T07:35:00Z) - Towards user-centered interactive medical image segmentation in VR with an assistive AI agent [0.5578116134031106]
We propose SAMIRA, a novel conversational AI agent for medical VR that assists users with localizing, segmenting, and visualizing 3D medical concepts.<n>The system also supports true-to-scale 3D visualization of segmented pathology to enhance patient-specific anatomical understanding.<n>With a user study, evaluations demonstrated a high usability score (SUS=90.0 $pm$ 9.0), low overall task load, and strong support for the proposed VR system's guidance.
arXiv Detail & Related papers (2025-05-12T03:47:05Z) - MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments [49.45034796115852]
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment.<n>Current datasets fall short in scale, realism and do not capture the nature of OR scenes, limiting multimodal in OR modeling.<n>We introduce MM-OR, a realistic and large-scale multimodal OR dataset, and first dataset to enable multimodal scene graph generation.
arXiv Detail & Related papers (2025-03-04T13:00:52Z) - REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via Multimodal Visual Feature Learning [0.7499722271664147]
A novel framework is proposed to perform real-time ego-motion tracking for endoscope.<n>A multi-modal visual feature learning network is proposed to perform relative pose prediction.<n>The absolute pose of endoscope is calculated based on relative poses.
arXiv Detail & Related papers (2025-01-30T03:58:41Z) - Surgical Triplet Recognition via Diffusion Model [59.50938852117371]
Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms.
We propose Difft, a new generative framework for surgical triplet recognition employing the diffusion model.
Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition.
arXiv Detail & Related papers (2024-06-19T04:43:41Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Rendezvous: Attention Mechanisms for the Recognition of Surgical Action
Triplets in Endoscopic Videos [12.725586100227337]
Action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities.
We introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels.
Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.
arXiv Detail & Related papers (2021-09-07T17:52:52Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Enhanced Self-Perception in Mixed Reality: Egocentric Arm Segmentation
and Database with Automatic Labelling [1.0149624140985476]
This study focuses on the egocentric segmentation of arms to improve self-perception in Augmented Virtuality.
We report results on different real egocentric hand datasets, including GTEA Gaze+, EDSH, EgoHands, Ego Youtube Hands, THU-Read, TEgO, FPAB, and Ego Gesture.
Results confirm the suitability of the EgoArm dataset for this task, achieving improvement up to 40% with respect to the original network.
arXiv Detail & Related papers (2020-03-27T12:09:27Z) - Robust Medical Instrument Segmentation Challenge 2019 [56.148440125599905]
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions.
Our challenge was based on a surgical data set comprising 10,040 annotated images acquired from a total of 30 surgical procedures.
The results confirm the initial hypothesis, namely that algorithm performance degrades with an increasing domain gap.
arXiv Detail & Related papers (2020-03-23T14:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.