FlexIT: Towards Flexible Semantic Image Translation
- URL: http://arxiv.org/abs/2203.04705v1
- Date: Wed, 9 Mar 2022 13:34:38 GMT
- Title: FlexIT: Towards Flexible Semantic Image Translation
- Authors: Guillaume Couairon and Asya Grechka and Jakob Verbeek and Holger
Schwenk and Matthieu Cord
- Abstract summary: We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
- Score: 59.09398209706869
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep generative models, like GANs, have considerably improved the state of
the art in image synthesis, and are able to generate near photo-realistic
images in structured domains such as human faces. Based on this success, recent
work on image editing proceeds by projecting images to the GAN latent space and
manipulating the latent vector. However, these approaches are limited in that
only images from a narrow domain can be transformed, and with only a limited
number of editing operations. We propose FlexIT, a novel method which can take
any input image and a user-defined text instruction for editing. Our method
achieves flexible and natural editing, pushing the limits of semantic image
translation. First, FlexIT combines the input image and text into a single
target point in the CLIP multimodal embedding space. Via the latent space of an
auto-encoder, we iteratively transform the input image toward the target point,
ensuring coherence and quality with a variety of novel regularization terms. We
propose an evaluation protocol for semantic image translation, and thoroughly
evaluate our method on ImageNet. Code will be made publicly available.
Related papers
- Zero-shot Text-driven Physically Interpretable Face Editing [29.32334174584623]
This paper proposes a novel and physically interpretable method for face editing based on arbitrary text prompts.
Our method can generate physically interpretable face editing results with high identity consistency and image quality.
arXiv Detail & Related papers (2023-08-11T07:20:24Z) - CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing [22.40686064568406]
We present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes.
Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds.
arXiv Detail & Related papers (2023-07-17T11:29:48Z) - Towards Arbitrary Text-driven Image Manipulation via Space Alignment [49.3370305074319]
We propose a new Text-driven image Manipulation framework via Space Alignment (TMSA)
TMSA aims to align the same semantic regions in CLIP and StyleGAN spaces.
The framework can support arbitrary image editing mode without additional cost.
arXiv Detail & Related papers (2023-01-25T16:20:01Z) - LDEdit: Towards Generalized Text Guided Image Manipulation via Latent
Diffusion Models [12.06277444740134]
generic image manipulation using a single model with flexible text inputs is highly desirable.
Recent work addresses this task by guiding generative models trained on the generic image using pretrained vision-language encoders.
We propose an optimization-free method for the task of generic image manipulation from text prompts.
arXiv Detail & Related papers (2022-10-05T13:26:15Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z) - In-Domain GAN Inversion for Real Image Editing [56.924323432048304]
A common practice of feeding a real image to a trained GAN generator is to invert it back to a latent code.
Existing inversion methods typically focus on reconstructing the target image by pixel values yet fail to land the inverted code in the semantic domain of the original latent space.
We propose an in-domain GAN inversion approach, which faithfully reconstructs the input image and ensures the inverted code to be semantically meaningful for editing.
arXiv Detail & Related papers (2020-03-31T18:20:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.