Vivim: a Video Vision Mamba for Medical Video Object Segmentation
- URL: http://arxiv.org/abs/2401.14168v3
- Date: Tue, 12 Mar 2024 14:45:49 GMT
- Title: Vivim: a Video Vision Mamba for Medical Video Object Segmentation
- Authors: Yijun Yang, Zhaohu Xing, Chunwang Huang, Lei Zhu
- Abstract summary: This paper presents a generic Video Vision Mamba-based framework, dubbed as bftextVivim, for medical video object segmentation tasks.
Our Vivim can effectively compress the long-termtemporal representation into sequences at varying scales by our designed Temporal Mamba Block.
We also introduce a boundary-aware constraint to enhance the discriminative ability of Vivim on ambiguous lesions in medical images.
- Score: 12.408219091543295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional convolutional neural networks have a limited receptive field
while transformer-based networks are mediocre in constructing long-term
dependency from the perspective of computational complexity. Such the
bottleneck poses a significant challenge when processing long sequences in
video analysis tasks. Very recently, the state space models (SSMs) with
efficient hardware-aware designs, famous by Mamba, have exhibited impressive
achievements in long sequence modeling, which facilitates the development of
deep neural networks on many vision tasks. To better capture available dynamic
cues in video frames, this paper presents a generic Video Vision Mamba-based
framework, dubbed as \textbf{Vivim}, for medical video object segmentation
tasks. Our Vivim can effectively compress the long-term spatiotemporal
representation into sequences at varying scales by our designed Temporal Mamba
Block. We also introduce a boundary-aware constraint to enhance the
discriminative ability of Vivim on ambiguous lesions in medical images.
Extensive experiments on thyroid segmentation in ultrasound videos and polyp
segmentation in colonoscopy videos demonstrate the effectiveness and efficiency
of our Vivim, superior to existing methods. The code is available at:
https://github.com/scott-yjyang/Vivim.
Related papers
- Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation [8.278068663433261]
We propose Vison Mamba-UNetV2, inspired by Mamba architecture, to capture contextual information in images.
VM-UNetV2 exhibits competitive performance in medical image segmentation tasks.
We conduct comprehensive experiments on the ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir CVC-ColonDB and ETIS-LaribPolypDB public datasets.
arXiv Detail & Related papers (2024-03-14T08:12:39Z) - VideoMamba: State Space Model for Efficient Video Understanding [46.17083617091239]
VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers.
Its linear-complexity operator enables efficient long-term modeling.
VideoMamba sets a new benchmark for video understanding.
arXiv Detail & Related papers (2024-03-11T17:59:34Z) - Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for
Scribble-based Medical Image Segmentation [13.748446415530937]
This paper introduces Weak-Mamba-UNet, an innovative weakly-supervised learning (WSL) framework for medical image segmentation.
WSL strategy incorporates three distinct architecture but same symmetrical encoder-decoder networks: a CNN-based UNet for detailed local feature extraction, a Swin Transformer-based SwinUNet for comprehensive global context understanding, and a VMamba-based Mamba-UNet for efficient long-range dependency modeling.
The effectiveness of Weak-Mamba-UNet is validated on a publicly available MRI cardiac segmentation dataset with processed annotations, where it surpasses the performance of a similar WSL
arXiv Detail & Related papers (2024-02-16T18:43:39Z) - Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation [21.1787366866505]
We propose Mamba-UNet, a novel architecture that synergizes the U-Net in medical image segmentation with Mamba's capability.
Mamba-UNet adopts a pure Visual Mamba (VMamba)-based encoder-decoder structure, infused with skip connections to preserve spatial information across different scales of the network.
arXiv Detail & Related papers (2024-02-07T18:33:04Z) - U-Mamba: Enhancing Long-range Dependency for Biomedical Image
Segmentation [10.083902382768406]
We introduce U-Mamba, a general-purpose network for biomedical image segmentation.
Inspired by the State Space Sequence Models (SSMs), a new family of deep sequence models, we design a hybrid CNN-SSM block.
We conduct experiments on four diverse tasks, including the 3D abdominal organ segmentation in CT and MR images, instrument segmentation in endoscopy images, and cell segmentation in microscopy images.
arXiv Detail & Related papers (2024-01-09T18:53:20Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results.
We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.