TFormer: A throughout fusion transformer for multi-modal skin lesion
diagnosis
- URL: http://arxiv.org/abs/2211.11393v1
- Date: Mon, 21 Nov 2022 12:07:05 GMT
- Title: TFormer: A throughout fusion transformer for multi-modal skin lesion
diagnosis
- Authors: Yilan Zhang, Fengying Xie, Jianqi Chen, Jie Liu
- Abstract summary: We introduce a pure transformer-based method, which we refer to as Throughout Fusion Transformer (TFormer)", for sufficient information intergration in MSLD.
We then carefully design a stack of dual-branch hierarchical multi-modal transformer (HMT) blocks to fuse information across different image modalities in a stage-by-stage way.
Our TFormer outperforms other state-of-the-art methods.
- Score: 6.899641625551976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal skin lesion diagnosis (MSLD) has achieved remarkable success by
modern computer-aided diagnosis technology based on deep convolutions. However,
the information aggregation across modalities in MSLD remains challenging due
to severity unaligned spatial resolution (dermoscopic image and clinical image)
and heterogeneous data (dermoscopic image and patients' meta-data). Limited by
the intrinsic local attention, most recent MSLD pipelines using pure
convolutions struggle to capture representative features in shallow layers,
thus the fusion across different modalities is usually done at the end of the
pipelines, even at the last layer, leading to an insufficient information
aggregation. To tackle the issue, we introduce a pure transformer-based method,
which we refer to as ``Throughout Fusion Transformer (TFormer)", for sufficient
information intergration in MSLD. Different from the existing approaches with
convolutions, the proposed network leverages transformer as feature extraction
backbone, bringing more representative shallow features. We then carefully
design a stack of dual-branch hierarchical multi-modal transformer (HMT) blocks
to fuse information across different image modalities in a stage-by-stage way.
With the aggregated information of image modalities, a multi-modal transformer
post-fusion (MTP) block is designed to integrate features across image and
non-image data. Such a strategy that information of the image modalities is
firstly fused then the heterogeneous ones enables us to better divide and
conquer the two major challenges while ensuring inter-modality dynamics are
effectively modeled. Experiments conducted on the public Derm7pt dataset
validate the superiority of the proposed method. Our TFormer outperforms other
state-of-the-art methods. Ablation experiments also suggest the effectiveness
of our designs.
Related papers
- Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond [74.96466744512992]
The essence of image fusion is to integrate complementary information from source images.
DeFusion++ produces versatile fused representations that can enhance the quality of image fusion and the effectiveness of downstream high-level vision tasks.
arXiv Detail & Related papers (2024-10-16T06:28:49Z) - Simultaneous Tri-Modal Medical Image Fusion and Super-Resolution using Conditional Diffusion Model [2.507050016527729]
Tri-modal medical image fusion can provide a more comprehensive view of the disease's shape, location, and biological activity.
Due to the limitations of imaging equipment and considerations for patient safety, the quality of medical images is usually limited.
There is an urgent need for a technology that can both enhance image resolution and integrate multi-modal information.
arXiv Detail & Related papers (2024-04-26T12:13:41Z) - AdaFuse: Adaptive Medical Image Fusion Based on Spatial-Frequential
Cross Attention [6.910879180358217]
We propose AdaFuse, in which multimodal image information is fused adaptively through frequency-guided attention mechanism.
The proposed method outperforms state-of-the-art methods in terms of both visual quality and quantitative metrics.
arXiv Detail & Related papers (2023-10-09T07:10:30Z) - Equivariant Multi-Modality Image Fusion [124.11300001864579]
We propose the Equivariant Multi-Modality imAge fusion paradigm for end-to-end self-supervised learning.
Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations.
Experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images.
arXiv Detail & Related papers (2023-05-19T05:50:24Z) - DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion [144.9653045465908]
We propose a novel fusion algorithm based on the denoising diffusion probabilistic model (DDPM)
Our approach yields promising fusion results in infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2023-03-13T04:06:42Z) - TranSiam: Fusing Multimodal Visual Features Using Transformer for
Medical Image Segmentation [4.777011444412729]
We propose a segmentation method suitable for multimodal medical images that can capture global information.
TranSiam is a 2D dual path network that extracts features of different modalities.
On the BraTS 2019 and BraTS 2020 multimodal datasets, we have a significant improvement in accuracy over other popular methods.
arXiv Detail & Related papers (2022-04-26T09:39:10Z) - TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation
with Transformers [8.139069987207494]
We present TransFusion, a Transformer-based architecture to merge divergent multi-view imaging information using convolutional layers and powerful attention mechanisms.
In particular, the Divergent Fusion Attention (DiFA) module is proposed for rich cross-view context modeling and semantic dependency mining.
arXiv Detail & Related papers (2022-03-21T04:02:54Z) - TransAttUnet: Multi-level Attention-guided U-Net with Transformer for
Medical Image Segmentation [33.45471457058221]
This paper proposes a novel Transformer based medical image semantic segmentation framework called TransAttUnet.
In particular, we establish additional multi-scale skip connections between decoder blocks to aggregate the different semantic-scale upsampling features.
Our method consistently outperforms the state-of-the-art baselines.
arXiv Detail & Related papers (2021-07-12T09:17:06Z) - Robust Multimodal Brain Tumor Segmentation via Feature Disentanglement
and Gated Fusion [71.87627318863612]
We propose a novel multimodal segmentation framework which is robust to the absence of imaging modalities.
Our network uses feature disentanglement to decompose the input modalities into the modality-specific appearance code.
We validate our method on the important yet challenging multimodal brain tumor segmentation task with the BRATS challenge dataset.
arXiv Detail & Related papers (2020-02-22T14:32:04Z) - Hi-Net: Hybrid-fusion Network for Multi-modal MR Image Synthesis [143.55901940771568]
We propose a novel Hybrid-fusion Network (Hi-Net) for multi-modal MR image synthesis.
In our Hi-Net, a modality-specific network is utilized to learn representations for each individual modality.
A multi-modal synthesis network is designed to densely combine the latent representation with hierarchical features from each modality.
arXiv Detail & Related papers (2020-02-11T08:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.