EndoMamba: An Efficient Foundation Model for Endoscopic Videos
- URL: http://arxiv.org/abs/2502.19090v1
- Date: Wed, 26 Feb 2025 12:36:16 GMT
- Title: EndoMamba: An Efficient Foundation Model for Endoscopic Videos
- Authors: Qingyao Tian, Huai Liao, Xinyan Huang, Bingyu Yang, Dongdong Lei, Sebastien Ourselin, Hongbin Liu,
- Abstract summary: Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance.<n>Recent video foundation models have shown promise, but their applications are hindered by computational inefficiencies and suboptimal performance caused by limited data for training in endoscopy.<n>To address these issues, we present EndoMamba, a foundation model model designed for real-time inference while incorporating generalized representations.
- Score: 2.747826950754128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba's representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks--classification, segmentation, surgical phase recognition, and localization--demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code will be released upon acceptance.
Related papers
- VADMamba: Exploring State Space Models for Fast Video Anomaly Detection [4.874215132369157]
VQ-Mamba Unet (VQ-MaU) framework incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block.
Results validate the efficacy of the proposed VADMamba across three benchmark datasets.
arXiv Detail & Related papers (2025-03-27T05:38:12Z) - EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction [66.84997711357101]
EventMamba is a specialized model designed for event-based video reconstruction tasks.
We show that EventMamba markedly improves speed while delivering superior visual quality compared to Transformer-based methods.
arXiv Detail & Related papers (2025-03-25T14:46:45Z) - STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection [48.997518615379995]
Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems.<n>Most existing methods based on CNNs and transformers still suffer from substantial computational burdens.<n>We propose a lightweight and effective Mamba-based network named STNMamba to enhance the learning of spatial-temporal normality.
arXiv Detail & Related papers (2024-12-28T08:49:23Z) - ECMamba: Consolidating Selective State Space Model with Retinex Guidance for Efficient Multiple Exposure Correction [48.77198487543991]
We introduce a novel framework based on Mamba for Exposure Correction (ECMamba) with dual pathways, each dedicated to the restoration of reflectance and illumination map.
Specifically, we derive the Retinex theory and we train a Retinex estimator capable of mapping inputs into two intermediary spaces.
We develop a novel 2D Selective State-space layer guided by Retinex information (Retinex-SS2D) as the core operator of ECMM.
arXiv Detail & Related papers (2024-10-28T21:02:46Z) - Mamba-Spike: Enhancing the Mamba Architecture with a Spiking Front-End for Efficient Temporal Data Processing [4.673285689826945]
Mamba-Spike is a novel neuromorphic architecture that integrates a spiking front-end with the Mamba backbone to achieve efficient temporal data processing.
The architecture consistently outperforms state-of-the-art baselines, achieving higher accuracy, lower latency, and improved energy efficiency.
arXiv Detail & Related papers (2024-08-04T14:10:33Z) - DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework [2.187990941788468]
generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio.
Model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture.
arXiv Detail & Related papers (2024-08-01T08:22:47Z) - SR-Mamba: Effective Surgical Phase Recognition with State Space Model [42.766718651973726]
SR-Mamba is a novel attention-free model specifically tailored to meet the challenges of surgical phase recognition.
In SR-Mamba, we leverage a bidirectional Mamba decoder to effectively model the temporal context in overlong sequences.
SR-Mamba establishes a new benchmark in surgical video analysis by demonstrating state-of-the-art performance on the Cholec80 and CATARACTS Challenge datasets.
arXiv Detail & Related papers (2024-07-11T09:34:31Z) - Vivim: a Video Vision Mamba for Medical Video Segmentation [52.11785024350253]
This paper presents a Video Vision Mamba-based framework, dubbed as Vivim, for medical video segmentation tasks.
Our Vivim can effectively compress the long-term representation into sequences at varying scales.
Experiments on thyroid segmentation, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim.
arXiv Detail & Related papers (2024-01-25T13:27:03Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - RT-GAN: Recurrent Temporal GAN for Adding Lightweight Temporal
Consistency to Frame-Based Domain Translation Approaches [3.7873597471903944]
We present a lightweight solution with a tunable temporal parameter, RT-GAN, for adding temporal consistency to individual frame-based approaches.
We demonstrate the effectiveness of our approach on two challenging use cases in colonoscopy.
arXiv Detail & Related papers (2023-10-02T03:13:26Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.