UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis
- URL: http://arxiv.org/abs/2105.14211v1
- Date: Sat, 29 May 2021 04:42:07 GMT
- Title: UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis
- Authors: Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li, Ming Ding, Jie
Tang, Jingren Zhou, and Hongxia Yang
- Abstract summary: Conditional image synthesis aims to create an image according to some multi-modal guidance.
We propose a new two-stage architecture, UFC-BERT, to unify any number of multi-modal controls.
- Score: 65.34414353024599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conditional image synthesis aims to create an image according to some
multi-modal guidance in the forms of textual descriptions, reference images,
and image blocks to preserve, as well as their combinations. In this paper,
instead of investigating these control signals separately, we propose a new
two-stage architecture, UFC-BERT, to unify any number of multi-modal controls.
In UFC-BERT, both the diverse control signals and the synthesized image are
uniformly represented as a sequence of discrete tokens to be processed by
Transformer. Different from existing two-stage autoregressive approaches such
as DALL-E and VQGAN, UFC-BERT adopts non-autoregressive generation (NAR) at the
second stage to enhance the holistic consistency of the synthesized image, to
support preserving specified image blocks, and to improve the synthesis speed.
Further, we design a progressive algorithm that iteratively improves the
non-autoregressively generated image, with the help of two estimators developed
for evaluating the compliance with the controls and evaluating the fidelity of
the synthesized image, respectively. Extensive experiments on a newly collected
large-scale clothing dataset M2C-Fashion and a facial dataset Multi-Modal
CelebA-HQ verify that UFC-BERT can synthesize high-fidelity images that comply
with flexible multi-modal controls.
Related papers
- AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z) - STAR: Scale-wise Text-to-image generation via Auto-Regressive representations [40.66170627483643]
We present STAR, a text-to-image model that employs scale-wise auto-regressive paradigm.
We show that STAR surpasses existing benchmarks in terms of fidelity,image text consistency, and aesthetic quality.
arXiv Detail & Related papers (2024-06-16T03:45:45Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - Bi-Modality Medical Image Synthesis Using Semi-Supervised Sequential
Generative Adversarial Networks [35.358653509217994]
We propose a bi-modality medical image synthesis approach based on sequential generative adversarial network (GAN) and semi-supervised learning.
Our approach consists of two generative modules that synthesize images of the two modalities in a sequential order.
Visual and quantitative results demonstrate the superiority of our method to the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-27T10:39:33Z) - MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal
Conditional Image Synthesis [73.08923361242925]
We propose to generate images conditioned on the compositions of multimodal control signals.
We introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals.
arXiv Detail & Related papers (2023-05-10T09:00:04Z) - IMAGINE: Image Synthesis by Image-Guided Model Inversion [79.4691654458141]
We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images.
We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations.
IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process.
arXiv Detail & Related papers (2021-04-13T02:00:24Z) - TSIT: A Simple and Versatile Framework for Image-to-Image Translation [103.92203013154403]
We introduce a simple and versatile framework for image-to-image translation.
We provide a carefully designed two-stream generative model with newly proposed feature transformations.
This allows multi-scale semantic structure information and style representation to be effectively captured and fused by the network.
A systematic study compares the proposed method with several state-of-the-art task-specific baselines, verifying its effectiveness in both perceptual quality and quantitative evaluations.
arXiv Detail & Related papers (2020-07-23T15:34:06Z) - Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood
Estimation [54.17177006826262]
We develop a new generic conditional image synthesis method based on Implicit Maximum Likelihood Estimation (IMLE)
We demonstrate improved multimodal image synthesis performance on two tasks, single image super-resolution and image synthesis from scene layouts.
arXiv Detail & Related papers (2020-04-07T03:06:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.