MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal
Conditional Image Synthesis
- URL: http://arxiv.org/abs/2305.05992v1
- Date: Wed, 10 May 2023 09:00:04 GMT
- Title: MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal
Conditional Image Synthesis
- Authors: Jianbin Zheng, Daqing Liu, Chaoyue Wang, Minghui Hu, Zuopeng Yang,
Changxing Ding, Dacheng Tao
- Abstract summary: We propose to generate images conditioned on the compositions of multimodal control signals.
We introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals.
- Score: 73.08923361242925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing multimodal conditional image synthesis (MCIS) methods generate
images conditioned on any combinations of various modalities that require all
of them must be exactly conformed, hindering the synthesis controllability and
leaving the potential of cross-modality under-exploited. To this end, we
propose to generate images conditioned on the compositions of multimodal
control signals, where modalities are imperfectly complementary, i.e., composed
multimodal conditional image synthesis (CMCIS). Specifically, we observe two
challenging issues of the proposed CMCIS task, i.e., the modality coordination
problem and the modality imbalance problem. To tackle these issues, we
introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses
fine-grained multimodal control signals, a multimodal balanced training loss to
stabilize the optimization of each modality, and a multimodal sampling guidance
to balance the strength of each modality control signal. Comprehensive
experimental results demonstrate that MMoT achieves superior performance on
both unimodal conditional image synthesis (UCIS) and MCIS tasks with
high-quality and faithful image synthesis on complex multimodal conditions. The
project website is available at https://jabir-zheng.github.io/MMoT.
Related papers
- Unified Brain MR-Ultrasound Synthesis using Multi-Modal Hierarchical
Representations [34.821129614819604]
We introduce MHVAE, a deep hierarchical variational auto-encoder (VAE) that synthesizes missing images from various modalities.
Extending multi-modal VAEs with a hierarchical latent structure, we introduce a probabilistic formulation for fusing multi-modal images in a common latent representation.
Our model outperformed multi-modal VAEs, conditional GANs, and the current state-of-the-art unified method (ResViT) for missing images.
arXiv Detail & Related papers (2023-09-15T20:21:03Z) - A Study of Syntactic Multi-Modality in Non-Autoregressive Machine
Translation [144.55713938260828]
It is difficult for non-autoregressive translation models to capture the multi-modal distribution of target translations.
We decompose it into short- and long-range syntactic multi-modalities and evaluate several recent NAT algorithms with advanced loss functions.
We design a new loss function to better handle the complicated syntactic multi-modality in real-world datasets.
arXiv Detail & Related papers (2022-07-09T06:48:10Z) - A Novel Unified Conditional Score-based Generative Framework for
Multi-modal Medical Image Completion [54.512440195060584]
We propose the Unified Multi-Modal Conditional Score-based Generative Model (UMM-CSGM) to take advantage of Score-based Generative Model (SGM)
UMM-CSGM employs a novel multi-in multi-out Conditional Score Network (mm-CSN) to learn a comprehensive set of cross-modal conditional distributions.
Experiments on BraTS19 dataset show that the UMM-CSGM can more reliably synthesize the heterogeneous enhancement and irregular area in tumor-induced lesions.
arXiv Detail & Related papers (2022-07-07T16:57:21Z) - One Model to Synthesize Them All: Multi-contrast Multi-scale Transformer
for Missing Data Imputation [3.9207133968068684]
We formulate missing data imputation as a sequence-to-sequence learning problem.
We propose a multi-contrast multi-scale Transformer (MMT) which can take any subset of input contrasts and synthesize those that are missing.
MMT is inherently interpretable as it allows us to understand the importance of each input contrast in different regions.
arXiv Detail & Related papers (2022-04-28T18:49:27Z) - UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis [65.34414353024599]
Conditional image synthesis aims to create an image according to some multi-modal guidance.
We propose a new two-stage architecture, UFC-BERT, to unify any number of multi-modal controls.
arXiv Detail & Related papers (2021-05-29T04:42:07Z) - Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood
Estimation [54.17177006826262]
We develop a new generic conditional image synthesis method based on Implicit Maximum Likelihood Estimation (IMLE)
We demonstrate improved multimodal image synthesis performance on two tasks, single image super-resolution and image synthesis from scene layouts.
arXiv Detail & Related papers (2020-04-07T03:06:55Z) - Hi-Net: Hybrid-fusion Network for Multi-modal MR Image Synthesis [143.55901940771568]
We propose a novel Hybrid-fusion Network (Hi-Net) for multi-modal MR image synthesis.
In our Hi-Net, a modality-specific network is utilized to learn representations for each individual modality.
A multi-modal synthesis network is designed to densely combine the latent representation with hierarchical features from each modality.
arXiv Detail & Related papers (2020-02-11T08:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.