Pretraining is All You Need for Image-to-Image Translation
- URL: http://arxiv.org/abs/2205.12952v1
- Date: Wed, 25 May 2022 17:58:26 GMT
- Title: Pretraining is All You Need for Image-to-Image Translation
- Authors: Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng
Chen, Fang Wen
- Abstract summary: We propose to use pretraining to boost general image-to-image translation.
We show that the proposed pretraining-based image-to-image translation (PITI) is capable of synthesizing images of unprecedented realism and faithfulness.
- Score: 59.43151345732397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose to use pretraining to boost general image-to-image translation.
Prior image-to-image translation methods usually need dedicated architectural
design and train individual translation models from scratch, struggling for
high-quality generation of complex scenes, especially when paired training data
are not abundant. In this paper, we regard each image-to-image translation
problem as a downstream task and introduce a simple and generic framework that
adapts a pretrained diffusion model to accommodate various kinds of
image-to-image translation. We also propose adversarial training to enhance the
texture synthesis in the diffusion model training, in conjunction with
normalized guidance sampling to improve the generation quality. We present
extensive empirical comparison across various tasks on challenging benchmarks
such as ADE20K, COCO-Stuff, and DIODE, showing the proposed pretraining-based
image-to-image translation (PITI) is capable of synthesizing images of
unprecedented realism and faithfulness.
Related papers
- SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Design Booster: A Text-Guided Diffusion Model for Image Translation with
Spatial Layout Preservation [12.365230063278625]
We propose a new approach for flexible image translation by learning a layout-aware image condition together with a text condition.
Our method co-encodes images and text into a new domain during the training phase.
Experimental comparisons of our method with state-of-the-art methods demonstrate our model performs best in both style image translation and semantic image translation.
arXiv Detail & Related papers (2023-02-05T02:47:13Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Unsupervised Image-to-Image Translation with Generative Prior [103.54337984566877]
Unsupervised image-to-image translation aims to learn the translation between two visual domains without paired data.
We present a novel framework, Generative Prior-guided UN Image-to-image Translation (GP-UNIT), to improve the overall quality and applicability of the translation algorithm.
arXiv Detail & Related papers (2022-04-07T17:59:23Z) - Deep Translation Prior: Test-time Training for Photorealistic Style
Transfer [36.82737412912885]
Recent techniques to solve photorealistic style transfer within deep convolutional neural networks (CNNs) generally require intensive training from large-scale datasets.
We propose a novel framework, dubbed Deep Translation Prior (DTP), to accomplish photorealistic style transfer through test-time training on given input image pair with untrained networks.
arXiv Detail & Related papers (2021-12-12T04:54:27Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.