Pretraining is All You Need for Image-to-Image Translation
- URL: http://arxiv.org/abs/2205.12952v1
- Date: Wed, 25 May 2022 17:58:26 GMT
- Title: Pretraining is All You Need for Image-to-Image Translation
- Authors: Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng
Chen, Fang Wen
- Abstract summary: We propose to use pretraining to boost general image-to-image translation.
We show that the proposed pretraining-based image-to-image translation (PITI) is capable of synthesizing images of unprecedented realism and faithfulness.
- Score: 59.43151345732397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose to use pretraining to boost general image-to-image translation.
Prior image-to-image translation methods usually need dedicated architectural
design and train individual translation models from scratch, struggling for
high-quality generation of complex scenes, especially when paired training data
are not abundant. In this paper, we regard each image-to-image translation
problem as a downstream task and introduce a simple and generic framework that
adapts a pretrained diffusion model to accommodate various kinds of
image-to-image translation. We also propose adversarial training to enhance the
texture synthesis in the diffusion model training, in conjunction with
normalized guidance sampling to improve the generation quality. We present
extensive empirical comparison across various tasks on challenging benchmarks
such as ADE20K, COCO-Stuff, and DIODE, showing the proposed pretraining-based
image-to-image translation (PITI) is capable of synthesizing images of
unprecedented realism and faithfulness.
Related papers
- Ensuring Consistency for In-Image Translation [47.1986912570945]
The in-image machine translation task involves translating text embedded within images, with the translated results presented in image format.
We propose the need to uphold two types of consistency in this task: translation consistency and image generation consistency.
We introduce a novel two-stage framework named HCIIT which involves text-image translation using a multimodal multilingual large language model in the first stage and image backfilling with a diffusion model in the second stage.
arXiv Detail & Related papers (2024-12-24T03:50:03Z) - Design Booster: A Text-Guided Diffusion Model for Image Translation with
Spatial Layout Preservation [12.365230063278625]
We propose a new approach for flexible image translation by learning a layout-aware image condition together with a text condition.
Our method co-encodes images and text into a new domain during the training phase.
Experimental comparisons of our method with state-of-the-art methods demonstrate our model performs best in both style image translation and semantic image translation.
arXiv Detail & Related papers (2023-02-05T02:47:13Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Unsupervised Image-to-Image Translation with Generative Prior [103.54337984566877]
Unsupervised image-to-image translation aims to learn the translation between two visual domains without paired data.
We present a novel framework, Generative Prior-guided UN Image-to-image Translation (GP-UNIT), to improve the overall quality and applicability of the translation algorithm.
arXiv Detail & Related papers (2022-04-07T17:59:23Z) - Deep Translation Prior: Test-time Training for Photorealistic Style
Transfer [36.82737412912885]
Recent techniques to solve photorealistic style transfer within deep convolutional neural networks (CNNs) generally require intensive training from large-scale datasets.
We propose a novel framework, dubbed Deep Translation Prior (DTP), to accomplish photorealistic style transfer through test-time training on given input image pair with untrained networks.
arXiv Detail & Related papers (2021-12-12T04:54:27Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.