SF-TMN: SlowFast Temporal Modeling Network for Surgical Phase
Recognition
- URL: http://arxiv.org/abs/2306.08859v1
- Date: Thu, 15 Jun 2023 05:04:29 GMT
- Title: SF-TMN: SlowFast Temporal Modeling Network for Surgical Phase
Recognition
- Authors: Bokai Zhang, Mohammad Hasan Sarhan, Bharti Goel, Svetlana Petculescu,
Amer Ghanem
- Abstract summary: We propose SlowFast Temporal Modeling Network (SF-TMN) for surgical phase recognition.
It can achieve frame-level full video temporal modeling and segment-level full video temporal modeling.
SF-TMN with ASFormer backbone outperforms the state-of-the-art Not End-to-End(TCN) method by 2.6% in accuracy and 7.4% in the Jaccard score.
- Score: 0.5669790037378094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic surgical phase recognition is one of the key technologies to
support Video-Based Assessment (VBA) systems for surgical education. Utilizing
temporal information is crucial for surgical phase recognition, hence various
recent approaches extract frame-level features to conduct full video temporal
modeling. For better temporal modeling, we propose SlowFast Temporal Modeling
Network (SF-TMN) for surgical phase recognition that can not only achieve
frame-level full video temporal modeling but also achieve segment-level full
video temporal modeling. We employ a feature extraction network, pre-trained on
the target dataset, to extract features from video frames as the training data
for SF-TMN. The Slow Path in SF-TMN utilizes all frame features for frame
temporal modeling. The Fast Path in SF-TMN utilizes segment-level features
summarized from frame features for segment temporal modeling. The proposed
paradigm is flexible regarding the choice of temporal modeling networks. We
explore MS-TCN and ASFormer models as temporal modeling networks and experiment
with multiple combination strategies for Slow and Fast Paths. We evaluate
SF-TMN on Cholec80 surgical phase recognition task and demonstrate that SF-TMN
can achieve state-of-the-art results on all considered metrics. SF-TMN with
ASFormer backbone outperforms the state-of-the-art Not End-to-End(TCN) method
by 2.6% in accuracy and 7.4% in the Jaccard score. We also evaluate SF-TMN on
action segmentation datasets including 50salads, GTEA, and Breakfast, and
achieve state-of-the-art results. The improvement in the results shows that
combining temporal information from both frame level and segment level by
refining outputs with temporal refinement stages is beneficial for the temporal
modeling of surgical phases.
Related papers
- ConSlide: Asynchronous Hierarchical Interaction Transformer with
Breakup-Reorganize Rehearsal for Continual Whole Slide Image Analysis [24.078490055421852]
Whole slide image (WSI) analysis has become increasingly important in the medical imaging community.
In this paper, we propose the FIRST continual learning framework for WSI analysis, named ConSlide.
arXiv Detail & Related papers (2023-08-25T11:58:25Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS
Instance Segmentation [10.789826145990016]
This paper presents a deep learning framework for medical video segmentation.
Our framework explicitly extracts features from neighbouring frames across the temporal dimension.
It incorporates them with a temporal feature blender, which then tokenises the high-level-temporal feature to form a strong global feature encoded via a Swin Transformer.
arXiv Detail & Related papers (2023-02-22T12:09:39Z) - ViTs for SITS: Vision Transformers for Satellite Image Time Series [52.012084080257544]
We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT)
TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
arXiv Detail & Related papers (2023-01-12T11:33:07Z) - Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS
Instance Segmentation [11.575821326313607]
We propose Video-TransUNet, a deep architecture for segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework.
In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconal architecture with multiple heads.
arXiv Detail & Related papers (2022-08-17T14:28:58Z) - Ultra-low Latency Spiking Neural Networks with Spatio-Temporal
Compression and Synaptic Convolutional Block [4.081968050250324]
Spiking neural networks (SNNs) have neuro-temporal information capability, low processing feature, and high biological plausibility.
Neuro-MNIST, CIFAR10-S, DVS128 gesture datasets need to aggregate individual events into frames with a higher temporal resolution for event stream classification.
We propose a processing-temporal compression method to aggregate individual events into a few time steps of NIST current to reduce the training and inference latency.
arXiv Detail & Related papers (2022-03-18T15:14:13Z) - Mutual Contrastive Learning to Disentangle Whole Slide Image
Representations for Glioma Grading [10.65788461379405]
Whole slide images (WSI) provide valuable phenotypic information for histological malignancy assessment and grading of tumors.
The most commonly used WSI are derived from formalin-fixed paraffin-embedded (FFPE) and frozen sections.
Here we propose a mutual contrastive learning scheme to integrate FFPE and frozen sections and disentangle cross-modality representations for glioma grading.
arXiv Detail & Related papers (2022-03-08T11:08:44Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Multi-Scale Semantics-Guided Neural Networks for Efficient
Skeleton-Based Human Action Recognition [140.18376685167857]
A simple yet effective multi-scale semantics-guided neural network is proposed for skeleton-based action recognition.
MS-SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.
arXiv Detail & Related papers (2021-11-07T03:50:50Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Learn to cycle: Time-consistent feature discovery for action recognition [83.43682368129072]
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
We introduce Squeeze Re Temporal Gates (SRTG), an approach that favors temporal activations with potential variations.
We show consistent improvement when using SRTPG blocks, with only a minimal increase in the number of GFLOs.
arXiv Detail & Related papers (2020-06-15T09:36:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.