StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis
- URL: http://arxiv.org/abs/2104.07098v1
- Date: Wed, 14 Apr 2021 19:58:24 GMT
- Title: StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis
- Authors: Moustafa Meshry, Yixuan Ren, Larry S Davis, Abhinav Shrivastava
- Abstract summary: We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
- Score: 68.3787368024951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel approach for multi-modal Image-to-image (I2I) translation.
To tackle the one-to-many relationship between input and output domains,
previous works use complex training objectives to learn a latent embedding,
jointly with the generator, that models the variability of the output domain.
In contrast, we directly model the style variability of images, independent of
the image synthesis task. Specifically, we pre-train a generic style encoder
using a novel proxy task to learn an embedding of images, from arbitrary
domains, into a low-dimensional style latent space. The learned latent space
introduces several advantages over previous traditional approaches to
multi-modal I2I translation. First, it is not dependent on the target dataset,
and generalizes well across multiple domains. Second, it learns a more powerful
and expressive latent space, which improves the fidelity of style capture and
transfer. The proposed style pre-training also simplifies the training
objective and speeds up the training significantly. Furthermore, we provide a
detailed study of the contribution of different loss terms to the task of
multi-modal I2I translation, and propose a simple alternative to VAEs to enable
sampling from unconstrained latent spaces. Finally, we achieve state-of-the-art
results on six challenging benchmarks with a simple training objective that
includes only a GAN loss and a reconstruction loss.
Related papers
- I2I-Galip: Unsupervised Medical Image Translation Using Generative Adversarial CLIP [30.506544165999564]
Unpaired image-to-image translation is a challenging task due to the absence of paired examples.
We propose a new image-to-image translation framework named Image-to-Image-Generative-Adversarial-CLIP (I2I-Galip)
arXiv Detail & Related papers (2024-09-19T01:44:50Z) - SCONE-GAN: Semantic Contrastive learning-based Generative Adversarial
Network for an end-to-end image translation [18.93434486338439]
SCONE-GAN is shown to be effective for learning to generate realistic and diverse scenery images.
For more realistic and diverse image generation we introduce style reference image.
We validate the proposed algorithm for image-to-image translation and stylizing outdoor images.
arXiv Detail & Related papers (2023-11-07T10:29:16Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - Pretraining is All You Need for Image-to-Image Translation [59.43151345732397]
We propose to use pretraining to boost general image-to-image translation.
We show that the proposed pretraining-based image-to-image translation (PITI) is capable of synthesizing images of unprecedented realism and faithfulness.
arXiv Detail & Related papers (2022-05-25T17:58:26Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Unsupervised Multi-Modal Medical Image Registration via
Discriminator-Free Image-to-Image Translation [4.43142018105102]
We propose a novel translation-based unsupervised deformable image registration approach to convert the multi-modal registration problem to a mono-modal one.
Our approach incorporates a discriminator-free translation network to facilitate the training of the registration network and a patchwise contrastive loss to encourage the translation network to preserve object shapes.
arXiv Detail & Related papers (2022-04-28T17:18:21Z) - Unsupervised Image-to-Image Translation with Generative Prior [103.54337984566877]
Unsupervised image-to-image translation aims to learn the translation between two visual domains without paired data.
We present a novel framework, Generative Prior-guided UN Image-to-image Translation (GP-UNIT), to improve the overall quality and applicability of the translation algorithm.
arXiv Detail & Related papers (2022-04-07T17:59:23Z) - UFO: A UniFied TransfOrmer for Vision-Language Representation Learning [54.82482779792115]
We propose a single UniFied transfOrmer (UFO) capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question) for vision-language (VL) representation learning.
Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks.
arXiv Detail & Related papers (2021-11-19T03:23:10Z) - One-Shot Generative Domain Adaptation [39.17324951275831]
This work aims at transferring a Generative Adversarial Network (GAN) pre-trained on one image domain to a new domain referring to as few as just one target image.
arXiv Detail & Related papers (2021-11-18T18:55:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.