Scaling up self-supervised learning for improved surgical foundation models
- URL: http://arxiv.org/abs/2501.09436v1
- Date: Thu, 16 Jan 2025 10:07:44 GMT
- Title: Scaling up self-supervised learning for improved surgical foundation models
- Authors: Tim J. M. Jaspers, Ronald L. P. D. de Jong, Yiping Li, Carolus H. J. Kusters, Franciscus H. A. Bakker, Romy C. van Jaarsveld, Gino M. Kuiper, Richard van Hillegersberg, Jelle P. Ruurda, Willem M. Brinkman, Josien P. W. Pluim, Peter H. N. de With, Marcel Breeuwer, Yasmina Al Khalil, Fons van der Sommen,
- Abstract summary: This study introduces SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision.<n>SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks.<n>These findings pave the way for improved generalizability and robustness in data-scarce scenarios.
- Score: 7.188884777849523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: https://github.com/TimJaspers0801/SurgeNet.
Related papers
- SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence [72.10889173696928]
We propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence.<n>We construct a large-scale multimodal surgical database, SurgVLM-DB, spanning more than 16 surgical types and 18 anatomical structures.<n>Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks.
arXiv Detail & Related papers (2025-06-03T07:44:41Z) - Attention-Enhanced U-Net for Accurate Segmentation of COVID-19 Infected Lung Regions in CT Scans [0.0]
We propose a robust methodology for automatic segmentation of infected lung regions in COVID-19 CT scans using convolutional neural networks.<n>The approach is based on a modified U-Net architecture enhanced with attention mechanisms, data augmentation, and postprocessing techniques.
arXiv Detail & Related papers (2025-05-18T08:27:12Z) - Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z) - SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.
Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.
We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z) - An Augmentation-based Model Re-adaptation Framework for Robust Image Segmentation [0.799543372823325]
We propose an Augmentation-based Model Re-adaptation Framework (AMRF) to enhance the generalisation of segmentation models.
By observing segmentation masks from conventional models (FCN and U-Net) and a pre-trained SAM model, we determine a minimal augmentation set that optimally balances training efficiency and model performance.
Our results demonstrate that the fine-tuned FCN surpasses its baseline by 3.29% and 3.02% in cropping accuracy, and 5.27% and 4.04% in classification accuracy on two temporally continuous datasets.
arXiv Detail & Related papers (2024-09-14T21:01:49Z) - Handling Geometric Domain Shifts in Semantic Segmentation of Surgical RGB and Hyperspectral Images [67.66644395272075]
We present first analysis of state-of-the-art semantic segmentation models when faced with geometric out-of-distribution data.
We propose an augmentation technique called "Organ Transplantation" to enhance generalizability.
Our augmentation technique improves SOA model performance by up to 67 % for RGB data and 90 % for HSI data, achieving performance at the level of in-distribution performance on real OOD test data.
arXiv Detail & Related papers (2024-08-27T19:13:15Z) - AMAES: Augmented Masked Autoencoder Pretraining on Public Brain MRI Data for 3D-Native Segmentation [2.0749231618270803]
This study investigates the impact of self-supervised pretraining of 3D semantic segmentation models on a large-scale, domain-specific dataset.
We introduce BRAINS-45K, a dataset of 44,756 brain MRI volumes from public sources, the largest public dataset available.
arXiv Detail & Related papers (2024-08-01T15:27:48Z) - Exploring the Effect of Dataset Diversity in Self-Supervised Learning for Surgical Computer Vision [5.782979506525853]
The impact of surgical computer vision remains limited compared to other medical fields like pathology and radiology.
Recent advancements in self-supervised learning have demonstrated superior performance.
This study investigates the role of dataset diversity in SSL for surgical computer vision.
arXiv Detail & Related papers (2024-07-25T09:49:04Z) - An evaluation of pre-trained models for feature extraction in image
classification [0.0]
This work aims to compare the performance of different pre-trained neural networks for feature extraction in image classification tasks.
Our results demonstrate that the best general performance along the datasets was achieved by CLIP-ViT-B and ViT-H-14, where the CLIP-ResNet50 model had similar performance but with less variability.
arXiv Detail & Related papers (2023-10-03T13:28:14Z) - SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation [83.18930314027254]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.
In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X) with up to ViT-Huge as the backbone.
With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments.
arXiv Detail & Related papers (2023-09-29T17:58:06Z) - Interpretable 2D Vision Models for 3D Medical Images [47.75089895500738]
This study proposes a simple approach of adapting 2D networks with an intermediate feature representation for processing 3D images.
We show on all 3D MedMNIST datasets as benchmark and two real-world datasets consisting of several hundred high-resolution CT or MRI scans that our approach performs on par with existing methods.
arXiv Detail & Related papers (2023-07-13T08:27:09Z) - STU-Net: Scalable and Transferable Medical Image Segmentation Models
Empowered by Large-Scale Supervised Pre-training [43.04882328763337]
We design a series of scalable U-Net (STU-Net) models, with parameter sizes ranging from 14 million to 1.4 billion.
We train our scalable STU-Net models on a large-scale TotalSegmentator dataset and find that increasing model size brings a stronger performance gain.
We observe good performance of our pre-trained model in both direct inference and fine-tuning.
arXiv Detail & Related papers (2023-04-13T17:59:13Z) - Semantic segmentation of surgical hyperspectral images under geometric
domain shifts [69.91792194237212]
We present the first analysis of state-of-the-art semantic segmentation networks in the presence of geometric out-of-distribution (OOD) data.
We also address generalizability with a dedicated augmentation technique termed "Organ Transplantation"
Our scheme improves on the SOA DSC by up to 67 % (RGB) and 90 % (HSI) and renders performance on par with in-distribution performance on real OOD test data.
arXiv Detail & Related papers (2023-03-20T09:50:07Z) - Learning Customized Visual Models with Retrieval-Augmented Knowledge [104.05456849611895]
We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains.
We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights.
The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
arXiv Detail & Related papers (2023-01-17T18:59:06Z) - Dissecting Self-Supervised Learning Methods for Surgical Computer Vision [51.370873913181605]
Self-Supervised Learning (SSL) methods have begun to gain traction in the general computer vision community.
The effectiveness of SSL methods in more complex and impactful domains, such as medicine and surgery, remains limited and unexplored.
We present an extensive analysis of the performance of these methods on the Cholec80 dataset for two fundamental and popular tasks in surgical context understanding, phase recognition and tool presence detection.
arXiv Detail & Related papers (2022-07-01T14:17:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.