Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms
- URL: http://arxiv.org/abs/2404.09231v1
- Date: Sun, 14 Apr 2024 12:19:16 GMT
- Title: Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms
- Authors: Diandian Guo, Manxi Lin, Jialun Pei, He Tang, Yueming Jin, Pheng-Ann Heng,
- Abstract summary: We propose a Tri-modal (i.e., images, point clouds, and language) confluence with Temporal dynamics framework, termed TriTemp-OR.
Our model performs temporal interactions across 2D frames and 3D point clouds, including a scale-adaptive multi-view temporal interaction (ViewTemp) and a geometric-temporal point aggregation (PointTemp)
The proposed TriTemp-OR enables the aggregation of tri-modal features through relation-aware unification to predict relations so as to generate scene graphs.
- Score: 47.31847567531981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A comprehensive understanding of surgical scenes allows for monitoring of the surgical process, reducing the occurrence of accidents and enhancing efficiency for medical professionals. Semantic modeling within operating rooms, as a scene graph generation (SGG) task, is challenging since it involves consecutive recognition of subtle surgical actions over prolonged periods. To address this challenge, we propose a Tri-modal (i.e., images, point clouds, and language) confluence with Temporal dynamics framework, termed TriTemp-OR. Diverging from previous approaches that integrated temporal information via memory graphs, our method embraces two advantages: 1) we directly exploit bi-modal temporal information from the video streaming for hierarchical feature interaction, and 2) the prior knowledge from Large Language Models (LLMs) is embedded to alleviate the class-imbalance problem in the operating theatre. Specifically, our model performs temporal interactions across 2D frames and 3D point clouds, including a scale-adaptive multi-view temporal interaction (ViewTemp) and a geometric-temporal point aggregation (PointTemp). Furthermore, we transfer knowledge from the biomedical LLM, LLaVA-Med, to deepen the comprehension of intraoperative relations. The proposed TriTemp-OR enables the aggregation of tri-modal features through relation-aware unification to predict relations so as to generate scene graphs. Experimental results on the 4D-OR benchmark demonstrate the superior performance of our model for long-term OR streaming.
Related papers
- ARN-LSTM: A Multi-Stream Attention-Based Model for Action Recognition with Temporal Dynamics [6.6713480895907855]
ARN-LSTM is a novel action recognition model designed to address the challenge of simultaneously capturing spatial motion and temporal dynamics in action sequences.
Our proposed model integrates joint, motion, and temporal information through a multi-stream fusion architecture.
arXiv Detail & Related papers (2024-11-04T03:29:51Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Mutual Information-driven Triple Interaction Network for Efficient Image
Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing.
The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal.
The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z) - LABRAD-OR: Lightweight Memory Scene Graphs for Accurate Bimodal
Reasoning in Dynamic Operating Rooms [39.11134330259464]
holistic modeling of the operating room (OR) is a challenging but essential task.
We introduce memory scene graphs, where the scene graphs of previous time steps act as the temporal representation guiding the current prediction.
We design an end-to-end architecture that intelligently fuses the temporal information of our lightweight memory scene graphs with the visual information from point clouds and images.
arXiv Detail & Related papers (2023-03-23T14:26:16Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - Learning Sequence Representations by Non-local Recurrent Neural Memory [61.65105481899744]
We propose a Non-local Recurrent Neural Memory (NRNM) for supervised sequence representation learning.
Our model is able to capture long-range dependencies and latent high-level features can be distilled by our model.
Our model compares favorably against other state-of-the-art methods specifically designed for each of these sequence applications.
arXiv Detail & Related papers (2022-07-20T07:26:15Z) - ST-MTL: Spatio-Temporal Multitask Learning Model to Predict Scanpath
While Tracking Instruments in Robotic Surgery [14.47768738295518]
Learning of the task-oriented attention while tracking instrument holds vast potential in image-guided robotic surgery.
We propose an end-to-end Multi-Task Learning (ST-MTL) model with a shared encoder and Sink-temporal decoders for the real-time surgical instrument segmentation and task-oriented saliency detection.
We tackle the problem with a novel asynchronous-temporal optimization technique by calculating independent gradients for each decoder.
Compared to the state-of-the-art segmentation and saliency methods, our model most outperforms the evaluation metrics and produces an outstanding performance in challenge
arXiv Detail & Related papers (2021-12-10T15:20:27Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Temporal Graph Modeling for Skeleton-based Action Recognition [25.788239844759246]
We propose a Temporal Enhanced Graph Convolutional Network (TE-GCN) to capture complex temporal dynamic.
The constructed temporal relation graph explicitly builds connections between semantically related temporal features.
Experiments are performed on two widely used large-scale datasets.
arXiv Detail & Related papers (2020-12-16T09:02:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.