Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis
- URL: http://arxiv.org/abs/2212.01108v3
- Date: Sun, 18 Jun 2023 14:14:17 GMT
- Title: Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis
- Authors: Yonghao Li, Tao Zhou, Kelei He, Yi Zhou, Dinggang Shen
- Abstract summary: Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
- Score: 52.41439725865149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modality magnetic resonance (MR) image synthesis can be used to
generate missing modalities from given ones. Existing (supervised learning)
methods often require a large number of paired multi-modal data to train an
effective synthesis model. However, it is often challenging to obtain
sufficient paired data for supervised training. In reality, we often have a
small number of paired data while a large number of unpaired data. To take
advantage of both paired and unpaired data, in this paper, we propose a
Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for
cross-modality MR image synthesis. Specifically, an Edge-preserving Masked
AutoEncoder (Edge-MAE) is first pre-trained in a self-supervised manner to
simultaneously perform 1) image imputation for randomly masked patches in each
image and 2) whole edge map estimation, which effectively learns both
contextual and structural information. Besides, a novel patch-wise loss is
proposed to enhance the performance of Edge-MAE by treating different masked
patches differently according to the difficulties of their respective
imputations. Based on this proposed pre-training, in the subsequent fine-tuning
stage, a Dual-scale Selective Fusion (DSF) module is designed (in our MT-Net)
to synthesize missing-modality images by integrating multi-scale features
extracted from the encoder of the pre-trained Edge-MAE. Further, this
pre-trained encoder is also employed to extract high-level features from the
synthesized image and corresponding ground-truth image, which are required to
be similar (consistent) in the training. Experimental results show that our
MT-Net achieves comparable performance to the competing methods even using
$70\%$ of all available paired data. Our code will be publicly available at
https://github.com/lyhkevin/MT-Net.
Related papers
- Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders [5.069884983892437]
We propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets.
In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations.
In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction.
Our approach is scalable, robust and suitable for pre-training RGB-D datasets.
arXiv Detail & Related papers (2024-08-05T05:33:59Z) - E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine
Translation [40.62692548291319]
Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language.
Existing methods, both two-stage cascade and one-stage end-to-end architectures, suffer from different issues.
We propose an end-to-end TIMT model fully making use of the knowledge from existing OCR and MT datasets.
arXiv Detail & Related papers (2023-05-09T04:25:52Z) - MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer [158.06850125920923]
diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image.
We propose a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image.
Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT.
arXiv Detail & Related papers (2023-03-25T07:47:21Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images.
It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images.
It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - MultiMAE: Multi-modal Multi-task Masked Autoencoders [2.6763498831034043]
We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE)
We show this pre-training strategy leads to a flexible, simple, and efficient framework with improved transfer results to downstream tasks.
arXiv Detail & Related papers (2022-04-04T17:50:41Z) - Rethinking Coarse-to-Fine Approach in Single Image Deblurring [19.195704769925925]
We present a fast and accurate deblurring network design using a multi-input multi-output U-net.
The proposed network outperforms the state-of-the-art methods in terms of both accuracy and computational complexity.
arXiv Detail & Related papers (2021-08-11T06:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.