Surgical Skill Assessment via Video Semantic Aggregation
- URL: http://arxiv.org/abs/2208.02611v1
- Date: Thu, 4 Aug 2022 12:24:01 GMT
- Title: Surgical Skill Assessment via Video Semantic Aggregation
- Authors: Zhenqiang Li, Lin Gu, Weimin Wang, Ryosuke Nakamura, and Yoichi Sato
- Abstract summary: We propose a skill assessment framework, Video Semantic Aggregation (ViSA), which discovers different semantic parts and aggregates them acrosstemporal dimensions.
The explicit discovery of semantic parts provides an explanatory visualization that helps understand the neural network's decisions.
The experiments on two datasets show the competitiveness of ViSA compared to state-of-the-art methods.
- Score: 20.396898001950156
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated video-based assessment of surgical skills is a promising task in
assisting young surgical trainees, especially in poor-resource areas. Existing
works often resort to a CNN-LSTM joint framework that models long-term
relationships by LSTMs on spatially pooled short-term CNN features. However,
this practice would inevitably neglect the difference among semantic concepts
such as tools, tissues, and background in the spatial dimension, impeding the
subsequent temporal relationship modeling. In this paper, we propose a novel
skill assessment framework, Video Semantic Aggregation (ViSA), which discovers
different semantic parts and aggregates them across spatiotemporal dimensions.
The explicit discovery of semantic parts provides an explanatory visualization
that helps understand the neural network's decisions. It also enables us to
further incorporate auxiliary information such as the kinematic data to improve
representation learning and performance. The experiments on two datasets show
the competitiveness of ViSA compared to state-of-the-art methods. Source code
is available at: bit.ly/MICCAI2022ViSA.
Related papers
- Cross-modal Contrastive Learning with Asymmetric Co-attention Network
for Video Moment Retrieval [0.17590081165362778]
Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities.
Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences.
We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information.
arXiv Detail & Related papers (2023-12-12T17:00:46Z) - Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS
Instance Segmentation [10.789826145990016]
This paper presents a deep learning framework for medical video segmentation.
Our framework explicitly extracts features from neighbouring frames across the temporal dimension.
It incorporates them with a temporal feature blender, which then tokenises the high-level-temporal feature to form a strong global feature encoded via a Swin Transformer.
arXiv Detail & Related papers (2023-02-22T12:09:39Z) - Temporally Constrained Neural Networks (TCNN): A framework for
semi-supervised video semantic segmentation [5.0754434714665715]
We present Temporally Constrained Neural Networks (TCNN), a semi-supervised framework used for video semantic segmentation of surgical videos.
In this work, we show that autoencoder networks can be used to efficiently provide both spatial and temporal supervisory signals.
We demonstrate that lower-dimensional representations of predicted masks can be leveraged to provide a consistent improvement on both sparsely labeled datasets.
arXiv Detail & Related papers (2021-12-27T18:06:12Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - SSAN: Separable Self-Attention Network for Video Representation Learning [11.542048296046524]
We propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially.
By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning.
Our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets.
arXiv Detail & Related papers (2021-05-27T10:02:04Z) - A journey in ESN and LSTM visualisations on a language task [77.34726150561087]
We trained ESNs and LSTMs on a Cross-Situationnal Learning (CSL) task.
The results are of three kinds: performance comparison, internal dynamics analyses and visualization of latent space.
arXiv Detail & Related papers (2020-12-03T08:32:01Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - IAUnet: Global Context-Aware Feature Learning for Person
Re-Identification [106.50534744965955]
IAU block enables the feature to incorporate the globally spatial, temporal, and channel context.
It is lightweight, end-to-end trainable, and can be easily plugged into existing CNNs to form IAUnet.
Experiments show that IAUnet performs favorably against state-of-the-art on both image and video reID tasks.
arXiv Detail & Related papers (2020-09-02T13:07:10Z) - LRTD: Long-Range Temporal Dependency based Active Learning for Surgical
Workflow Recognition [67.86810761677403]
We propose a novel active learning method for cost-effective surgical video analysis.
Specifically, we propose a non-local recurrent convolutional network (NL-RCNet), which introduces non-local block to capture the long-range temporal dependency.
We validate our approach on a large surgical video dataset (Cholec80) by performing surgical workflow recognition task.
arXiv Detail & Related papers (2020-04-21T09:21:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.