One Model to Edit Them All: Free-Form Text-Driven Image Manipulation
with Semantic Modulations
- URL: http://arxiv.org/abs/2210.07883v2
- Date: Mon, 17 Oct 2022 06:11:48 GMT
- Title: One Model to Edit Them All: Free-Form Text-Driven Image Manipulation
with Semantic Modulations
- Authors: Yiming Zhu and Hongyu Liu and Yibing Song and ziyang Yuan and Xintong
Han and Chun Yuan and Qifeng Chen and Jue Wang
- Abstract summary: Free-Form CLIP aims to establish an automatic latent mapping so that one manipulation model handles free-form text prompts.
For one type of image (e.g., human portrait'), one FFCLIP model can be learned to handle free-form text prompts.
Both visual and numerical results show that FFCLIP effectively produces semantically accurate and visually realistic images.
- Score: 75.81725681546071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Free-form text prompts allow users to describe their intentions during image
manipulation conveniently. Based on the visual latent space of StyleGAN[21] and
text embedding space of CLIP[34], studies focus on how to map these two latent
spaces for text-driven attribute manipulations. Currently, the latent mapping
between these two spaces is empirically designed and confines that each
manipulation model can only handle one fixed text prompt. In this paper, we
propose a method named Free-Form CLIP (FFCLIP), aiming to establish an
automatic latent mapping so that one manipulation model handles free-form text
prompts. Our FFCLIP has a cross-modality semantic modulation module containing
semantic alignment and injection. The semantic alignment performs the automatic
latent mapping via linear transformations with a cross attention mechanism.
After alignment, we inject semantics from text prompt embeddings to the
StyleGAN latent space. For one type of image (e.g., `human portrait'), one
FFCLIP model can be learned to handle free-form text prompts. Meanwhile, we
observe that although each training text prompt only contains a single semantic
meaning, FFCLIP can leverage text prompts with multiple semantic meanings for
image manipulation. In the experiments, we evaluate FFCLIP on three types of
images (i.e., `human portraits', `cars', and `churches'). Both visual and
numerical results show that FFCLIP effectively produces semantically accurate
and visually realistic images. Project page:
https://github.com/KumapowerLIU/FFCLIP.
Related papers
- DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided
Image Editing [22.354236929932476]
Text-guided image editing faces significant challenges to training and inference flexibility.
We propose a novel framework called DeltaEdit, which maps the CLIP visual feature differences to the latent space directions of a generative model.
Experiments validate the effectiveness and versatility of DeltaEdit with different generative models.
arXiv Detail & Related papers (2023-10-12T15:43:12Z) - Entity-Level Text-Guided Image Manipulation [70.81648416508867]
We study a novel task on text-guided image manipulation on the entity level in the real world (eL-TGIM)
We propose an elegant framework, dubbed as SeMani, forming the Semantic Manipulation of real-world images.
In the semantic alignment phase, SeMani incorporates a semantic alignment module to locate the entity-relevant region to be manipulated.
In the image manipulation phase, SeMani adopts a generative model to synthesize new images conditioned on the entity-irrelevant regions and target text descriptions.
arXiv Detail & Related papers (2023-02-22T13:56:23Z) - Towards Arbitrary Text-driven Image Manipulation via Space Alignment [49.3370305074319]
We propose a new Text-driven image Manipulation framework via Space Alignment (TMSA)
TMSA aims to align the same semantic regions in CLIP and StyleGAN spaces.
The framework can support arbitrary image editing mode without additional cost.
arXiv Detail & Related papers (2023-01-25T16:20:01Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields [33.43993665841577]
We present CLIP-NeRF, a multi-modal 3D object manipulation method for neural radiance fields (NeRF)
We propose a unified framework that allows manipulating NeRF in a user-friendly way, using either a short text prompt or an exemplar image.
We evaluate our approach by extensive experiments on a variety of text prompts and exemplar images.
arXiv Detail & Related papers (2021-12-09T18:59:55Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.