Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance
- URL: http://arxiv.org/abs/2506.01980v1
- Date: Fri, 16 May 2025 14:02:24 GMT
- Title: Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance
- Authors: Lianhao Yin, Ozanan Meireles, Guy Rosman, Daniela Rus,
- Abstract summary: Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
- Score: 50.486523249499115
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS). However, supervised learning approaches require large, annotated datasets that are scarce due to annotation efforts that are prohibitive, e.g., in medical fields. Although self-supervision methods can address such limitations, current self-supervised methods often fail to capture structural and physical information in a form that generalizes across tasks. We propose Compress-to-Explore (C2E), a novel self-supervised framework that leverages Kolmogorov complexity to learn compact, informative representations from surgical videos. C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data. Trained on large-scale unlabeled surgical datasets, C2E demonstrates strong generalization across a variety of surgical ML tasks, such as workflow classification, tool-tissue interaction classification, segmentation, and diagnosis tasks, providing improved performance as a surgical visual foundation model. As we further show in the paper, the model's internal compact representation better disentangles features from different structural parts of images. The resulting performance improvements highlight the yet untapped potential of self-supervised learning to enhance surgical AI and improve outcomes in MIS.
Related papers
- CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery [38.59801869721841]
This paper introduces the Cataract Surgery Scene Graph (CAT-SG) dataset, the first to provide structured annotations of tool-tissue interactions, procedural variations, and temporal variations.<n>By incorporating detailed semantic relations, CAT-SG offers a holistic view of surgical dependencies, enabling more accurate recognition of surgical phases and techniques.
arXiv Detail & Related papers (2025-06-26T23:25:23Z) - SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z) - Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities [65.66373425605278]
Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events.<n>Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases.<n>This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure.
arXiv Detail & Related papers (2025-04-26T15:37:22Z) - OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [60.75854609803651]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.<n>OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.<n>Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z) - Efficient Surgical Tool Recognition via HMM-Stabilized Deep Learning [25.146476653453227]
We propose an HMM-stabilized deep learning method for tool presence detection.
A range of experiments confirm that the proposed approaches achieve better performance with lower training and running costs.
These results suggest that popular deep learning approaches with over-complicated model structures may suffer from inefficient utilization of data.
arXiv Detail & Related papers (2024-04-07T15:27:35Z) - Dual-scale Enhanced and Cross-generative Consistency Learning for Semi-supervised Medical Image Segmentation [49.57907601086494]
Medical image segmentation plays a crucial role in computer-aided diagnosis.
We propose a novel Dual-scale Enhanced and Cross-generative consistency learning framework for semi-supervised medical image (DEC-Seg)
arXiv Detail & Related papers (2023-12-26T12:56:31Z) - Dynamic Scene Graph Representation for Surgical Video [37.22552586793163]
We exploit scene graphs as a more holistic, semantically meaningful and human-readable way to represent surgical videos.
We create a scene graph dataset from semantic segmentations from the CaDIS and CATARACTS datasets.
We demonstrate the benefits of surgical scene graphs regarding the explainability and robustness of model decisions.
arXiv Detail & Related papers (2023-09-25T21:28:14Z) - SurgMAE: Masked Autoencoders for Long Surgical Video Analysis [4.866110274299399]
Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs)
In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain.
We propose SurgMAE, which is a novel architecture with a masking strategy on sampling high-temporal tokens for MAE.
arXiv Detail & Related papers (2023-05-19T06:12:50Z) - Effective semantic segmentation in Cataract Surgery: What matters most? [5.1151054398496685]
Our work proposes neural network design choices that set the state-of-the-art on a challenging public benchmark on cataract surgery, CaDIS.
Our methodology achieves strong performance across three semantic segmentation tasks with increasingly granular surgical tool class sets.
arXiv Detail & Related papers (2021-08-13T08:27:54Z) - LRTD: Long-Range Temporal Dependency based Active Learning for Surgical
Workflow Recognition [67.86810761677403]
We propose a novel active learning method for cost-effective surgical video analysis.
Specifically, we propose a non-local recurrent convolutional network (NL-RCNet), which introduces non-local block to capture the long-range temporal dependency.
We validate our approach on a large surgical video dataset (Cholec80) by performing surgical workflow recognition task.
arXiv Detail & Related papers (2020-04-21T09:21:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.