Towards Arbitrary Text-driven Image Manipulation via Space Alignment
- URL: http://arxiv.org/abs/2301.10670v3
- Date: Thu, 21 Sep 2023 03:14:18 GMT
- Title: Towards Arbitrary Text-driven Image Manipulation via Space Alignment
- Authors: Yunpeng Bai, Zihan Zhong, Chao Dong, Weichen Zhang, Guowei Xu, Chun
Yuan
- Abstract summary: We propose a new Text-driven image Manipulation framework via Space Alignment (TMSA)
TMSA aims to align the same semantic regions in CLIP and StyleGAN spaces.
The framework can support arbitrary image editing mode without additional cost.
- Score: 49.3370305074319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent GAN inversion methods have been able to successfully invert the
real image input to the corresponding editable latent code in StyleGAN. By
combining with the language-vision model (CLIP), some text-driven image
manipulation methods are proposed. However, these methods require extra costs
to perform optimization for a certain image or a new attribute editing mode. To
achieve a more efficient editing method, we propose a new Text-driven image
Manipulation framework via Space Alignment (TMSA). The Space Alignment module
aims to align the same semantic regions in CLIP and StyleGAN spaces. Then, the
text input can be directly accessed into the StyleGAN space and be used to find
the semantic shift according to the text description. The framework can support
arbitrary image editing mode without additional cost. Our work provides the
user with an interface to control the attributes of a given image according to
text input and get the result in real time. Ex tensive experiments demonstrate
our superior performance over prior works.
Related papers
- Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - Robust Text-driven Image Editing Method that Adaptively Explores
Directions in Latent Spaces of StyleGAN and CLIP [10.187432367590201]
A pioneering work in text-driven image editing, StyleCLIP, finds an edit direction in the CLIP space and then edits the image by mapping the direction to the StyleGAN space.
At the same time, it is difficult to tune appropriate inputs other than the original image and text instructions for image editing.
We propose a method to construct the edit direction adaptively in the StyleGAN and CLIP spaces with SVM.
arXiv Detail & Related papers (2023-04-03T13:30:48Z) - One Model to Edit Them All: Free-Form Text-Driven Image Manipulation
with Semantic Modulations [75.81725681546071]
Free-Form CLIP aims to establish an automatic latent mapping so that one manipulation model handles free-form text prompts.
For one type of image (e.g., human portrait'), one FFCLIP model can be learned to handle free-form text prompts.
Both visual and numerical results show that FFCLIP effectively produces semantically accurate and visually realistic images.
arXiv Detail & Related papers (2022-10-14T15:06:05Z) - LDEdit: Towards Generalized Text Guided Image Manipulation via Latent
Diffusion Models [12.06277444740134]
generic image manipulation using a single model with flexible text inputs is highly desirable.
Recent work addresses this task by guiding generative models trained on the generic image using pretrained vision-language encoders.
We propose an optimization-free method for the task of generic image manipulation from text prompts.
arXiv Detail & Related papers (2022-10-05T13:26:15Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.