Rendezvous in Time: An Attention-based Temporal Fusion approach for
Surgical Triplet Recognition
- URL: http://arxiv.org/abs/2211.16963v2
- Date: Fri, 16 Jun 2023 09:55:05 GMT
- Title: Rendezvous in Time: An Attention-based Temporal Fusion approach for
Surgical Triplet Recognition
- Authors: Saurav Sharma, Chinedu Innocent Nwoye, Didier Mutter, Nicolas Padoy
- Abstract summary: One of the recent advances in surgical AI is the recognition of surgical activities as triplets of (instrument, verb, target)
Exploiting the temporal cues from earlier frames would improve the recognition of surgical action triplets from videos.
We propose Rendezvous in Time (RiT) - a deep learning model that extends the state-of-the-art model, Rendezvous, with temporal modeling.
- Score: 5.033722555649178
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: One of the recent advances in surgical AI is the recognition of surgical
activities as triplets of (instrument, verb, target). Albeit providing detailed
information for computer-assisted intervention, current triplet recognition
approaches rely only on single frame features. Exploiting the temporal cues
from earlier frames would improve the recognition of surgical action triplets
from videos. In this paper, we propose Rendezvous in Time (RiT) - a deep
learning model that extends the state-of-the-art model, Rendezvous, with
temporal modeling. Focusing more on the verbs, our RiT explores the
connectedness of current and past frames to learn temporal attention-based
features for enhanced triplet recognition. We validate our proposal on the
challenging surgical triplet dataset, CholecT45, demonstrating an improved
recognition of the verb and triplet along with other interactions involving the
verb such as (instrument, verb). Qualitative results show that the RiT produces
smoother predictions for most triplet instances than the state-of-the-arts. We
present a novel attention-based approach that leverages the temporal fusion of
video frames to model the evolution of surgical actions and exploit their
benefits for surgical triplet recognition.
Related papers
- Surgical Triplet Recognition via Diffusion Model [59.50938852117371]
Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms.
We propose Difft, a new generative framework for surgical triplet recognition employing the diffusion model.
Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition.
arXiv Detail & Related papers (2024-06-19T04:43:41Z) - Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms [47.31847567531981]
We propose a Tri-modal (i.e., images, point clouds, and language) confluence with Temporal dynamics framework, termed TriTemp-OR.
Our model performs temporal interactions across 2D frames and 3D point clouds, including a scale-adaptive multi-view temporal interaction (ViewTemp) and a geometric-temporal point aggregation (PointTemp)
The proposed TriTemp-OR enables the aggregation of tri-modal features through relation-aware unification to predict relations so as to generate scene graphs.
arXiv Detail & Related papers (2024-04-14T12:19:16Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - CholecTriplet2022: Show me a tool and tell me the triplet -- an
endoscopic vision challenge for surgical action triplet detection [41.66666272822756]
This paper presents the CholecTriplet2022 challenge, which extends surgical action triplet modeling from recognition to detection.
It includes weakly-supervised bounding box localization of every visible surgical instrument (or tool) as the key actors, and the modeling of each tool-activity in the form of instrument, verb, target> triplet.
arXiv Detail & Related papers (2023-02-13T11:53:14Z) - CholecTriplet2021: A benchmark challenge for surgical action triplet
recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos.
We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.
A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Rendezvous: Attention Mechanisms for the Recognition of Surgical Action
Triplets in Endoscopic Videos [12.725586100227337]
Action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities.
We introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels.
Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.
arXiv Detail & Related papers (2021-09-07T17:52:52Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.