Vivim: a Video Vision Mamba for Medical Video Segmentation
- URL: http://arxiv.org/abs/2401.14168v4
- Date: Thu, 1 Aug 2024 15:56:43 GMT
- Title: Vivim: a Video Vision Mamba for Medical Video Segmentation
- Authors: Yijun Yang, Zhaohu Xing, Lequan Yu, Chunwang Huang, Huazhu Fu, Lei Zhu,
- Abstract summary: This paper presents a Video Vision Mamba-based framework, dubbed as Vivim, for medical video segmentation tasks.
Our Vivim can effectively compress the long-term representation into sequences at varying scales.
Experiments on thyroid segmentation, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim.
- Score: 52.11785024350253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical video segmentation gains increasing attention in clinical practice due to the redundant dynamic references in video frames. However, traditional convolutional neural networks have a limited receptive field and transformer-based networks are mediocre in constructing long-term dependency from the perspective of computational complexity. This bottleneck poses a significant challenge when processing longer sequences in medical video analysis tasks using available devices with limited memory. Recently, state space models (SSMs), famous by Mamba, have exhibited impressive achievements in efficient long sequence modeling, which develops deep neural networks by expanding the receptive field on many vision tasks significantly. Unfortunately, vanilla SSMs failed to simultaneously capture causal temporal cues and preserve non-casual spatial information. To this end, this paper presents a Video Vision Mamba-based framework, dubbed as Vivim, for medical video segmentation tasks. Our Vivim can effectively compress the long-term spatiotemporal representation into sequences at varying scales with our designed Temporal Mamba Block. We also introduce an improved boundary-aware affine constraint across frames to enhance the discriminative ability of Vivim on ambiguous lesions. Extensive experiments on thyroid segmentation, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim, superior to existing methods. The code is available at: https://github.com/scott-yjyang/Vivim. The dataset will be released once accepted.
Related papers
- LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba [20.435381963248787]
Previous deep learning based r measurement are primarily based on CNNs and Transformers.
We propose PhysMamba, a Mamba-based framework, to efficiently represent long-range physiological dependencies from facial videos.
Extensive experiments are conducted on three benchmark datasets to demonstrate the superiority and efficiency of PhysMamba.
arXiv Detail & Related papers (2024-09-18T14:48:50Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation [8.278068663433261]
We propose Vison Mamba-UNetV2, inspired by Mamba architecture, to capture contextual information in images.
VM-UNetV2 exhibits competitive performance in medical image segmentation tasks.
We conduct comprehensive experiments on the ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir CVC-ColonDB and ETIS-LaribPolypDB public datasets.
arXiv Detail & Related papers (2024-03-14T08:12:39Z) - U-Mamba: Enhancing Long-range Dependency for Biomedical Image
Segmentation [10.083902382768406]
We introduce U-Mamba, a general-purpose network for biomedical image segmentation.
Inspired by the State Space Sequence Models (SSMs), a new family of deep sequence models, we design a hybrid CNN-SSM block.
We conduct experiments on four diverse tasks, including the 3D abdominal organ segmentation in CT and MR images, instrument segmentation in endoscopy images, and cell segmentation in microscopy images.
arXiv Detail & Related papers (2024-01-09T18:53:20Z) - Temporally Constrained Neural Networks (TCNN): A framework for
semi-supervised video semantic segmentation [5.0754434714665715]
We present Temporally Constrained Neural Networks (TCNN), a semi-supervised framework used for video semantic segmentation of surgical videos.
In this work, we show that autoencoder networks can be used to efficiently provide both spatial and temporal supervisory signals.
We demonstrate that lower-dimensional representations of predicted masks can be leveraged to provide a consistent improvement on both sparsely labeled datasets.
arXiv Detail & Related papers (2021-12-27T18:06:12Z) - FetReg: Placental Vessel Segmentation and Registration in Fetoscopy
Challenge Dataset [57.30136148318641]
Fetoscopy laser photocoagulation is a widely used procedure for the treatment of Twin-to-Twin Transfusion Syndrome (TTTS)
This may lead to increased procedural time and incomplete ablation, resulting in persistent TTTS.
Computer-assisted intervention may help overcome these challenges by expanding the fetoscopic field of view through video mosaicking and providing better visualization of the vessel network.
We present a large-scale multi-centre dataset for the development of generalized and robust semantic segmentation and video mosaicking algorithms for the fetal environment with a focus on creating drift-free mosaics from long duration fetoscopy videos.
arXiv Detail & Related papers (2021-06-10T17:14:27Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Multi-frame Feature Aggregation for Real-time Instrument Segmentation in
Endoscopic Video [11.100734994959419]
We propose a novel Multi-frame Feature Aggregation (MFFA) module to aggregate video frame features temporally and spatially.
We also develop a method that can randomly synthesize a surgical frame sequence from a single labeled frame to assist network training.
arXiv Detail & Related papers (2020-11-17T16:27:27Z) - A Prospective Study on Sequence-Driven Temporal Sampling and Ego-Motion
Compensation for Action Recognition in the EPIC-Kitchens Dataset [68.8204255655161]
Action recognition is one of the top-challenging research fields in computer vision.
ego-motion recorded sequences have become of important relevance.
The proposed method aims to cope with it by estimating this ego-motion or camera motion.
arXiv Detail & Related papers (2020-08-26T14:44:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.