SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image
Segmentation
- URL: http://arxiv.org/abs/2307.12591v1
- Date: Mon, 24 Jul 2023 08:06:46 GMT
- Title: SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image
Segmentation
- Authors: Yiqing Wang, Zihan Li, Jieru Mei, Zihao Wei, Li Liu, Chen Wang,
Shengtian Sang, Alan Yuille, Cihang Xie, Yuyin Zhou
- Abstract summary: We present Masked Multi-view with Swin Transformers (SwinMM), a novel multi-view pipeline for medical image analysis.
In the pre-training phase, we deploy a masked multi-view encoder devised to concurrently train masked multi-view observations.
A new task capitalizes on the consistency between predictions from various perspectives, enabling the extraction of hidden multi-view information.
- Score: 32.092182889440814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in large-scale Vision Transformers have made significant
strides in improving pre-trained models for medical image segmentation.
However, these methods face a notable challenge in acquiring a substantial
amount of pre-training data, particularly within the medical field. To address
this limitation, we present Masked Multi-view with Swin Transformers (SwinMM),
a novel multi-view pipeline for enabling accurate and data-efficient
self-supervised medical image analysis. Our strategy harnesses the potential of
multi-view information by incorporating two principal components. In the
pre-training phase, we deploy a masked multi-view encoder devised to
concurrently train masked multi-view observations through a range of diverse
proxy tasks. These tasks span image reconstruction, rotation, contrastive
learning, and a novel task that employs a mutual learning paradigm. This new
task capitalizes on the consistency between predictions from various
perspectives, enabling the extraction of hidden multi-view information from 3D
medical data. In the fine-tuning stage, a cross-view decoder is developed to
aggregate the multi-view information through a cross-attention block. Compared
with the previous state-of-the-art self-supervised learning method Swin UNETR,
SwinMM demonstrates a notable advantage on several medical image segmentation
tasks. It allows for a smooth integration of multi-view information,
significantly boosting both the accuracy and data-efficiency of the model. Code
and models are available at https://github.com/UCSC-VLAA/SwinMM/.
Related papers
- MOSMOS: Multi-organ segmentation facilitated by medical report supervision [10.396987980136602]
We propose a novel pre-training & fine-tuning framework for Multi-Organ Supervision (MOS)
Specifically, we first introduce global contrastive learning to align medical image-report pairs in the pre-training stage.
To remedy the discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags.
arXiv Detail & Related papers (2024-09-04T03:46:17Z) - MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis [9.227314308722047]
Mask AutoEncoder (MAE) for feature pre-training can unleash the potential of ViT on various medical vision tasks.
We propose a novel textitMask in Mask (MiM) pre-training framework for 3D medical images.
arXiv Detail & Related papers (2024-04-24T01:14:33Z) - Enhancing Weakly Supervised 3D Medical Image Segmentation through
Probabilistic-aware Learning [52.249748801637196]
3D medical image segmentation is a challenging task with crucial implications for disease diagnosis and treatment planning.
Recent advances in deep learning have significantly enhanced fully supervised medical image segmentation.
We propose a novel probabilistic-aware weakly supervised learning pipeline, specifically designed for 3D medical imaging.
arXiv Detail & Related papers (2024-03-05T00:46:53Z) - MV-Swin-T: Mammogram Classification with Multi-view Swin Transformer [0.257133335028485]
We propose an innovative multi-view network based on transformers to address challenges in mammographic image classification.
Our approach introduces a novel shifted window-based dynamic attention block, facilitating the effective integration of multi-view information.
arXiv Detail & Related papers (2024-02-26T04:41:04Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - MvCo-DoT:Multi-View Contrastive Domain Transfer Network for Medical
Report Generation [42.804058630251305]
We propose the first multi-view medical report generation model, called MvCo-DoT.
MvCo-DoT first propose a multi-view contrastive learning (MvCo) strategy to help the deep reinforcement learning based model utilize the consistency of multi-view inputs.
Extensive experiments on the IU X-Ray public dataset show that MvCo-DoT outperforms the SOTA medical report generation baselines in all metrics.
arXiv Detail & Related papers (2023-04-15T03:42:26Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - UNetFormer: A Unified Vision Transformer Model and Pre-Training
Framework for 3D Medical Image Segmentation [14.873473285148853]
We introduce a unified framework consisting of two architectures, dubbed UNetFormer, with a 3D Swin Transformer-based encoder and Conal Neural Network (CNN) and transformer-based decoders.
In the proposed model, the encoder is linked to the decoder via skip connections at five different resolutions with deep supervision.
We present a methodology for self-supervised pre-training of the encoder backbone via learning to predict randomly masked tokens.
arXiv Detail & Related papers (2022-04-01T17:38:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.