Rendezvous: Attention Mechanisms for the Recognition of Surgical Action
Triplets in Endoscopic Videos
- URL: http://arxiv.org/abs/2109.03223v1
- Date: Tue, 7 Sep 2021 17:52:52 GMT
- Title: Rendezvous: Attention Mechanisms for the Recognition of Surgical Action
Triplets in Endoscopic Videos
- Authors: Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger,
Pietro Mascagni, Didier Mutter, Jacques Marescaux, Nicolas Padoy
- Abstract summary: Action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities.
We introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels.
Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.
- Score: 12.725586100227337
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Out of all existing frameworks for surgical workflow analysis in endoscopic
videos, action triplet recognition stands out as the only one aiming to provide
truly fine-grained and comprehensive information on surgical activities. This
information, presented as <instrument, verb, target> combinations, is highly
challenging to be accurately identified. Triplet components can be difficult to
recognize individually; in this task, it requires not only performing
recognition simultaneously for all three triplet components, but also correctly
establishing the data association between them. To achieve this task, we
introduce our new model, the Rendezvous (RDV), which recognizes triplets
directly from surgical videos by leveraging attention at two different levels.
We first introduce a new form of spatial attention to capture individual action
triplet components in a scene; called the Class Activation Guided Attention
Mechanism (CAGAM). This technique focuses on the recognition of verbs and
targets using activations resulting from instruments. To solve the association
problem, our RDV model adds a new form of semantic attention inspired by
Transformer networks. Using multiple heads of cross and self attentions, RDV is
able to effectively capture relationships between instruments, verbs, and
targets. We also introduce CholecT50 - a dataset of 50 endoscopic videos in
which every frame has been annotated with labels from 100 triplet classes. Our
proposed RDV model significantly improves the triplet prediction mAP by over 9%
compared to the state-of-the-art methods on this dataset.
Related papers
- Surgical Triplet Recognition via Diffusion Model [59.50938852117371]
Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms.
We propose Difft, a new generative framework for surgical triplet recognition employing the diffusion model.
Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition.
arXiv Detail & Related papers (2024-06-19T04:43:41Z) - Surgical Action Triplet Detection by Mixed Supervised Learning of
Instrument-Tissue Interactions [5.033722555649178]
Surgical action triplets describe instrument-tissue interactions as (instrument, verb, target) combinations.
This work focuses on surgical action triplet detection, which is challenging but more precise than the traditional triplet recognition task.
We propose MCIT-IG, a two-stage network, that stands for Multi-Class Instrument-aware Transformer-Interaction Graph.
arXiv Detail & Related papers (2023-07-18T18:47:48Z) - Rendezvous in Time: An Attention-based Temporal Fusion approach for
Surgical Triplet Recognition [5.033722555649178]
One of the recent advances in surgical AI is the recognition of surgical activities as triplets of (instrument, verb, target)
Exploiting the temporal cues from earlier frames would improve the recognition of surgical action triplets from videos.
We propose Rendezvous in Time (RiT) - a deep learning model that extends the state-of-the-art model, Rendezvous, with temporal modeling.
arXiv Detail & Related papers (2022-11-30T13:18:07Z) - Triple-View Feature Learning for Medical Image Segmentation [9.992387025633805]
TriSegNet is a semi-supervised semantic segmentation framework.
It uses triple-view feature learning on a limited amount of labelled data and a large amount of unlabeled data.
arXiv Detail & Related papers (2022-08-12T14:41:40Z) - Pseudo-label Guided Cross-video Pixel Contrast for Robotic Surgical
Scene Segmentation with Limited Annotations [72.15956198507281]
We propose PGV-CL, a novel pseudo-label guided cross-video contrast learning method to boost scene segmentation.
We extensively evaluate our method on a public robotic surgery dataset EndoVis18 and a public cataract dataset CaDIS.
arXiv Detail & Related papers (2022-07-20T05:42:19Z) - CholecTriplet2021: A benchmark challenge for surgical action triplet
recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos.
We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.
A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - Recognition of Instrument-Tissue Interactions in Endoscopic Videos via
Action Triplets [9.517537672430006]
We tackle the recognition of fine-grained activities, modeled as action triplets instrument, verb, target> representing the tool activity.
We introduce a new laparoscopic dataset, CholecT40, consisting of 40 videos from the public dataset Cholec80 in which all frames have been annotated using 128 triplet classes.
arXiv Detail & Related papers (2020-07-10T14:17:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.