SurgMAE: Masked Autoencoders for Long Surgical Video Analysis
- URL: http://arxiv.org/abs/2305.11451v1
- Date: Fri, 19 May 2023 06:12:50 GMT
- Title: SurgMAE: Masked Autoencoders for Long Surgical Video Analysis
- Authors: Muhammad Abdullah Jamal, Omid Mohareri
- Abstract summary: Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs)
In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain.
We propose SurgMAE, which is a novel architecture with a masking strategy on sampling high-temporal tokens for MAE.
- Score: 4.866110274299399
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: There has been a growing interest in using deep learning models for
processing long surgical videos, in order to automatically detect
clinical/operational activities and extract metrics that can enable workflow
efficiency tools and applications. However, training such models require vast
amounts of labeled data which is costly and not scalable. Recently,
self-supervised learning has been explored in computer vision community to
reduce the burden of the annotation cost. Masked autoencoders (MAE) got the
attention in self-supervised paradigm for Vision Transformers (ViTs) by
predicting the randomly masked regions given the visible patches of an image or
a video clip, and have shown superior performance on benchmark datasets.
However, the application of MAE in surgical data remains unexplored. In this
paper, we first investigate whether MAE can learn transferrable representations
in surgical video domain. We propose SurgMAE, which is a novel architecture
with a masking strategy based on sampling high spatio-temporal tokens for MAE.
We provide an empirical study of SurgMAE on two large scale long surgical video
datasets, and find that our method outperforms several baselines in low data
regime. We conduct extensive ablation studies to show the efficacy of our
approach and also demonstrate it's superior performance on UCF-101 to prove
it's generalizability in non-surgical datasets as well.
Related papers
- Beyond Labels: A Self-Supervised Framework with Masked Autoencoders and Random Cropping for Breast Cancer Subtype Classification [0.3374875022248865]
We learn a self-supervised embedding tailored for computer vision tasks in this domain.
We generate a large dataset from WSIs automatically.
We evaluate our model's performance on the BRACS dataset and compare it with existing benchmarks.
arXiv Detail & Related papers (2024-10-15T19:13:05Z) - Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology [2.7280901660033643]
This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs)
Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases.
We develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time.
arXiv Detail & Related papers (2024-04-16T02:42:06Z) - Efficient Surgical Tool Recognition via HMM-Stabilized Deep Learning [25.146476653453227]
We propose an HMM-stabilized deep learning method for tool presence detection.
A range of experiments confirm that the proposed approaches achieve better performance with lower training and running costs.
These results suggest that popular deep learning approaches with over-complicated model structures may suffer from inefficient utilization of data.
arXiv Detail & Related papers (2024-04-07T15:27:35Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic
Facial Expression Recognition [47.29528724322795]
MAE-DFER is a novel self-supervised method for learning dynamic facial expressions.
It uses large-scale self-supervised pre-training on abundant unlabeled data.
It consistently outperforms state-of-the-art supervised methods.
arXiv Detail & Related papers (2023-07-05T12:08:56Z) - AutoLaparo: A New Dataset of Integrated Multi-tasks for Image-guided
Surgical Automation in Laparoscopic Hysterectomy [42.20922574566824]
We present and release the first integrated dataset with multiple image-based perception tasks to facilitate learning-based automation in hysterectomy surgery.
Our AutoLaparo dataset is developed based on full-length videos of entire hysterectomy procedures.
Specifically, three different yet highly correlated tasks are formulated in the dataset, including surgical workflow recognition, laparoscope motion prediction, and instrument and key anatomy segmentation.
arXiv Detail & Related papers (2022-08-03T13:17:23Z) - Dissecting Self-Supervised Learning Methods for Surgical Computer Vision [51.370873913181605]
Self-Supervised Learning (SSL) methods have begun to gain traction in the general computer vision community.
The effectiveness of SSL methods in more complex and impactful domains, such as medicine and surgery, remains limited and unexplored.
We present an extensive analysis of the performance of these methods on the Cholec80 dataset for two fundamental and popular tasks in surgical context understanding, phase recognition and tool presence detection.
arXiv Detail & Related papers (2022-07-01T14:17:11Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - LRTD: Long-Range Temporal Dependency based Active Learning for Surgical
Workflow Recognition [67.86810761677403]
We propose a novel active learning method for cost-effective surgical video analysis.
Specifically, we propose a non-local recurrent convolutional network (NL-RCNet), which introduces non-local block to capture the long-range temporal dependency.
We validate our approach on a large surgical video dataset (Cholec80) by performing surgical workflow recognition task.
arXiv Detail & Related papers (2020-04-21T09:21:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.