Temporal Memory Relation Network for Workflow Recognition from Surgical
Video
- URL: http://arxiv.org/abs/2103.16327v1
- Date: Tue, 30 Mar 2021 13:20:26 GMT
- Title: Temporal Memory Relation Network for Workflow Recognition from Surgical
Video
- Authors: Yueming Jin, Yonghao Long, Cheng Chen, Zixu Zhao, Qi Dou, Pheng-Ann
Heng
- Abstract summary: We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
- Score: 53.20825496640025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic surgical workflow recognition is a key component for developing
context-aware computer-assisted systems in the operating theatre. Previous
works either jointly modeled the spatial features with short fixed-range
temporal information, or separately learned visual and long temporal cues. In
this paper, we propose a novel end-to-end temporal memory relation network
(TMRNet) for relating long-range and multi-scale temporal patterns to augment
the present features. We establish a long-range memory bank to serve as a
memory cell storing the rich supportive information. Through our designed
temporal variation layer, the supportive cues are further enhanced by
multi-scale temporal-only convolutions. To effectively incorporate the two
types of cues without disturbing the joint learning of spatio-temporal
features, we introduce a non-local bank operator to attentively relate the past
to the present. In this regard, our TMRNet enables the current feature to view
the long-range temporal dependency, as well as tolerate complex temporal
extents. We have extensively validated our approach on two benchmark surgical
video datasets, M2CAI challenge dataset and Cholec80 dataset. Experimental
results demonstrate the outstanding performance of our method, consistently
exceeding the state-of-the-art methods by a large margin (e.g., 67.0% v.s.
78.9% Jaccard on Cholec80 dataset).
Related papers
- MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD [27.472705540825316]
This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long)
We propose an alternative to attention-based schemes which is based on a low-rank approximation of the memory obtained using Singular Value Decomposition.
Our scheme has two advantages: (a) it reduces complexity by more than an order of magnitude, and (b) it is amenable to an efficient implementation for the calculation of the memory bases.
arXiv Detail & Related papers (2024-06-11T12:03:57Z) - FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification [49.06447472006251]
We propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification.
Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results.
arXiv Detail & Related papers (2022-09-22T21:15:58Z) - Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology.
We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z) - LSTA-Net: Long short-term Spatio-Temporal Aggregation Network for
Skeleton-based Action Recognition [14.078419675904446]
LSTA-Net: a novel short-term Spatio-Temporal Network.
Long/short-term temporal information is not well explored in existing works.
Experiments were conducted on three public benchmark datasets.
arXiv Detail & Related papers (2021-11-01T10:53:35Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - An Enhanced Adversarial Network with Combined Latent Features for
Spatio-Temporal Facial Affect Estimation in the Wild [1.3007851628964147]
This paper proposes a novel model that efficiently extracts both spatial and temporal features of the data by means of its enhanced temporal modelling based on latent features.
Our proposed model consists of three major networks, coined Generator, Discriminator, and Combiner, which are trained in an adversarial setting combined with curriculum learning to enable our adaptive attention modules.
arXiv Detail & Related papers (2021-02-18T04:10:12Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Sequential Recommender via Time-aware Attentive Memory Network [67.26862011527986]
We propose a temporal gating methodology to improve attention mechanism and recurrent units.
We also propose a Multi-hop Time-aware Attentive Memory network to integrate long-term and short-term preferences.
Our approach is scalable for candidate retrieval tasks and can be viewed as a non-linear generalization of latent factorization for dot-product based Top-K recommendation.
arXiv Detail & Related papers (2020-05-18T11:29:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.