$\mathsf{CSMAE~}$:~Cataract Surgical Masked Autoencoder (MAE) based Pre-training
- URL: http://arxiv.org/abs/2502.08822v1
- Date: Wed, 12 Feb 2025 22:24:49 GMT
- Title: $\mathsf{CSMAE~}$:~Cataract Surgical Masked Autoencoder (MAE) based Pre-training
- Authors: Nisarg A. Shah, Wele Gedara Chaminda Bandara, Shameema Skider, S. Swaroop Vedula, Vishal M. Patel,
- Abstract summary: We introduce a Masked Autoencoder (MAE)-based pretraining approach, specifically developed for Cataract Surgery video analysis.
Instead of randomly selecting tokens for masking they are selected based on the importance of the token token.
This approach surpasses current state-of-the-art self-supervised pretraining and adapter-based learning methods by a significant margin.
- Score: 25.71088804562768
- License:
- Abstract: Automated analysis of surgical videos is crucial for improving surgical training, workflow optimization, and postoperative assessment. We introduce a CSMAE, Masked Autoencoder (MAE)-based pretraining approach, specifically developed for Cataract Surgery video analysis, where instead of randomly selecting tokens for masking, they are selected based on the spatiotemporal importance of the token. We created a large dataset of cataract surgery videos to improve the model's learning efficiency and expand its robustness in low-data regimes. Our pre-trained model can be easily adapted for specific downstream tasks via fine-tuning, serving as a robust backbone for further analysis. Through rigorous testing on a downstream step-recognition task on two Cataract Surgery video datasets, D99 and Cataract-101, our approach surpasses current state-of-the-art self-supervised pretraining and adapter-based transfer learning methods by a significant margin. This advancement not only demonstrates the potential of our MAE-based pretraining in the field of surgical video analysis but also sets a new benchmark for future research.
Related papers
- SurgMAE: Masked Autoencoders for Long Surgical Video Analysis [4.866110274299399]
Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs)
In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain.
We propose SurgMAE, which is a novel architecture with a masking strategy on sampling high-temporal tokens for MAE.
arXiv Detail & Related papers (2023-05-19T06:12:50Z) - AutoLaparo: A New Dataset of Integrated Multi-tasks for Image-guided
Surgical Automation in Laparoscopic Hysterectomy [42.20922574566824]
We present and release the first integrated dataset with multiple image-based perception tasks to facilitate learning-based automation in hysterectomy surgery.
Our AutoLaparo dataset is developed based on full-length videos of entire hysterectomy procedures.
Specifically, three different yet highly correlated tasks are formulated in the dataset, including surgical workflow recognition, laparoscope motion prediction, and instrument and key anatomy segmentation.
arXiv Detail & Related papers (2022-08-03T13:17:23Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Intelligent Masking: Deep Q-Learning for Context Encoding in Medical
Image Analysis [48.02011627390706]
We develop a novel self-supervised approach that occludes targeted regions to improve the pre-training procedure.
We show that training the agent against the prediction model can significantly improve the semantic features extracted for downstream classification tasks.
arXiv Detail & Related papers (2022-03-25T19:05:06Z) - How Transferable Are Self-supervised Features in Medical Image
Classification Tasks? [0.7734726150561086]
Transfer learning has become a standard practice to mitigate the lack of labeled data in medical classification tasks.
Self-supervised pretrained models yield richer embeddings than their supervised counterpart.
Dynamic Visual Meta-Embedding (DVME) is an end-to-end transfer learning approach that fuses pretrained embeddings from multiple models.
arXiv Detail & Related papers (2021-08-23T10:39:31Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - Recurrent and Spiking Modeling of Sparse Surgical Kinematics [0.8458020117487898]
A growing number of studies have used machine learning to analyze video and kinematic data captured from surgical robots.
In this study, we explore the possibility of using only kinematic data to predict surgeons of similar skill levels.
We report that it is possible to identify surgical fellows receiving near perfect scores in the simulation exercises based on their motion characteristics alone.
arXiv Detail & Related papers (2020-05-12T15:41:45Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z) - LRTD: Long-Range Temporal Dependency based Active Learning for Surgical
Workflow Recognition [67.86810761677403]
We propose a novel active learning method for cost-effective surgical video analysis.
Specifically, we propose a non-local recurrent convolutional network (NL-RCNet), which introduces non-local block to capture the long-range temporal dependency.
We validate our approach on a large surgical video dataset (Cholec80) by performing surgical workflow recognition task.
arXiv Detail & Related papers (2020-04-21T09:21:22Z) - Automatic Data Augmentation via Deep Reinforcement Learning for
Effective Kidney Tumor Segmentation [57.78765460295249]
We develop a novel automatic learning-based data augmentation method for medical image segmentation.
In our method, we innovatively combine the data augmentation module and the subsequent segmentation module in an end-to-end training manner with a consistent loss.
We extensively evaluated our method on CT kidney tumor segmentation which validated the promising results of our method.
arXiv Detail & Related papers (2020-02-22T14:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.