Related papers: Vivim: a Video Vision Mamba for Medical Video Object Segmentation

Vivim: a Video Vision Mamba for Medical Video Object Segmentation

URL: http://arxiv.org/abs/2401.14168v3
Date: Tue, 12 Mar 2024 14:45:49 GMT
Title: Vivim: a Video Vision Mamba for Medical Video Object Segmentation
Authors: Yijun Yang, Zhaohu Xing, Chunwang Huang, Lei Zhu
Abstract summary: This paper presents a generic Video Vision Mamba-based framework, dubbed as bftextVivim, for medical video object segmentation tasks. Our Vivim can effectively compress the long-termtemporal representation into sequences at varying scales by our designed Temporal Mamba Block. We also introduce a boundary-aware constraint to enhance the discriminative ability of Vivim on ambiguous lesions in medical images.
Score: 12.408219091543295
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional convolutional neural networks have a limited receptive field while transformer-based networks are mediocre in constructing long-term dependency from the perspective of computational complexity. Such the bottleneck poses a significant challenge when processing long sequences in video analysis tasks. Very recently, the state space models (SSMs) with efficient hardware-aware designs, famous by Mamba, have exhibited impressive achievements in long sequence modeling, which facilitates the development of deep neural networks on many vision tasks. To better capture available dynamic cues in video frames, this paper presents a generic Video Vision Mamba-based framework, dubbed as \textbf{Vivim}, for medical video object segmentation tasks. Our Vivim can effectively compress the long-term spatiotemporal representation into sequences at varying scales by our designed Temporal Mamba Block. We also introduce a boundary-aware constraint to enhance the discriminative ability of Vivim on ambiguous lesions in medical images. Extensive experiments on thyroid segmentation in ultrasound videos and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim, superior to existing methods. The code is available at: https://github.com/scott-yjyang/Vivim.

Related papers

MambaVesselNet++: A Hybrid CNN-Mamba Architecture for Medical Image Segmentation [21.20366935690067]
We propose MambaVesselNet++, a Hybrid CNN-Mamba framework for medical image segmentation.<n>MambaVesselNet++ is comprised of a hybrid image encoder (Hi-Encoder) and a bifocal fusion decoder (BF-Decoder)
arXiv Detail & Related papers (2025-07-26T12:32:59Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
EndoMamba: An Efficient Foundation Model for Endoscopic Videos [2.747826950754128]
Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. Recent video foundation models have shown promise, but their applications are hindered by computational inefficiencies and suboptimal performance caused by limited data for training in endoscopy. To address these issues, we present EndoMamba, a foundation model model designed for real-time inference while incorporating generalized representations.
arXiv Detail & Related papers (2025-02-26T12:36:16Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba [20.435381963248787]
Previous deep learning based r measurement are primarily based on CNNs and Transformers. We propose PhysMamba, a Mamba-based framework, to efficiently represent long-range physiological dependencies from facial videos. Extensive experiments are conducted on three benchmark datasets to demonstrate the superiority and efficiency of PhysMamba.
arXiv Detail & Related papers (2024-09-18T14:48:50Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation [8.278068663433261]
We propose Vison Mamba-UNetV2, inspired by Mamba architecture, to capture contextual information in images. VM-UNetV2 exhibits competitive performance in medical image segmentation tasks. We conduct comprehensive experiments on the ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir CVC-ColonDB and ETIS-LaribPolypDB public datasets.
arXiv Detail & Related papers (2024-03-14T08:12:39Z)
U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation [10.083902382768406]
We introduce U-Mamba, a general-purpose network for biomedical image segmentation. Inspired by the State Space Sequence Models (SSMs), a new family of deep sequence models, we design a hybrid CNN-SSM block. We conduct experiments on four diverse tasks, including the 3D abdominal organ segmentation in CT and MR images, instrument segmentation in endoscopy images, and cell segmentation in microscopy images.
arXiv Detail & Related papers (2024-01-09T18:53:20Z)
Temporally Constrained Neural Networks (TCNN): A framework for semi-supervised video semantic segmentation [5.0754434714665715]
We present Temporally Constrained Neural Networks (TCNN), a semi-supervised framework used for video semantic segmentation of surgical videos. In this work, we show that autoencoder networks can be used to efficiently provide both spatial and temporal supervisory signals. We demonstrate that lower-dimensional representations of predicted masks can be leveraged to provide a consistent improvement on both sparsely labeled datasets.
arXiv Detail & Related papers (2021-12-27T18:06:12Z)
FetReg: Placental Vessel Segmentation and Registration in Fetoscopy Challenge Dataset [57.30136148318641]
Fetoscopy laser photocoagulation is a widely used procedure for the treatment of Twin-to-Twin Transfusion Syndrome (TTTS) This may lead to increased procedural time and incomplete ablation, resulting in persistent TTTS. Computer-assisted intervention may help overcome these challenges by expanding the fetoscopic field of view through video mosaicking and providing better visualization of the vessel network. We present a large-scale multi-centre dataset for the development of generalized and robust semantic segmentation and video mosaicking algorithms for the fetal environment with a focus on creating drift-free mosaics from long duration fetoscopy videos.
arXiv Detail & Related papers (2021-06-10T17:14:27Z)
Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos. We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks. The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
Multi-frame Feature Aggregation for Real-time Instrument Segmentation in Endoscopic Video [11.100734994959419]
We propose a novel Multi-frame Feature Aggregation (MFFA) module to aggregate video frame features temporally and spatially. We also develop a method that can randomly synthesize a surgical frame sequence from a single labeled frame to assist network training.
arXiv Detail & Related papers (2020-11-17T16:27:27Z)
A Prospective Study on Sequence-Driven Temporal Sampling and Ego-Motion Compensation for Action Recognition in the EPIC-Kitchens Dataset [68.8204255655161]
Action recognition is one of the top-challenging research fields in computer vision. ego-motion recorded sequences have become of important relevance. The proposed method aims to cope with it by estimating this ego-motion or camera motion.
arXiv Detail & Related papers (2020-08-26T14:44:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.