Entity-Level Text-Guided Image Manipulation
- URL: http://arxiv.org/abs/2302.11383v1
- Date: Wed, 22 Feb 2023 13:56:23 GMT
- Title: Entity-Level Text-Guided Image Manipulation
- Authors: Yikai Wang, Jianan Wang, Guansong Lu, Hang Xu, Zhenguo Li, Wei Zhang,
and Yanwei Fu
- Abstract summary: We study a novel task on text-guided image manipulation on the entity level in the real world (eL-TGIM)
We propose an elegant framework, dubbed as SeMani, forming the Semantic Manipulation of real-world images.
In the semantic alignment phase, SeMani incorporates a semantic alignment module to locate the entity-relevant region to be manipulated.
In the image manipulation phase, SeMani adopts a generative model to synthesize new images conditioned on the entity-irrelevant regions and target text descriptions.
- Score: 70.81648416508867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing text-guided image manipulation methods aim to modify the appearance
of the image or to edit a few objects in a virtual or simple scenario, which is
far from practical applications. In this work, we study a novel task on
text-guided image manipulation on the entity level in the real world (eL-TGIM).
The task imposes three basic requirements, (1) to edit the entity consistent
with the text descriptions, (2) to preserve the entity-irrelevant regions, and
(3) to merge the manipulated entity into the image naturally. To this end, we
propose an elegant framework, dubbed as SeMani, forming the Semantic
Manipulation of real-world images that can not only edit the appearance of
entities but also generate new entities corresponding to the text guidance. To
solve eL-TGIM, SeMani decomposes the task into two phases: the semantic
alignment phase and the image manipulation phase. In the semantic alignment
phase, SeMani incorporates a semantic alignment module to locate the
entity-relevant region to be manipulated. In the image manipulation phase,
SeMani adopts a generative model to synthesize new images conditioned on the
entity-irrelevant regions and target text descriptions. We discuss and propose
two popular generation processes that can be utilized in SeMani, the discrete
auto-regressive generation with transformers and the continuous denoising
generation with diffusion models, yielding SeMani-Trans and SeMani-Diff,
respectively. We conduct extensive experiments on the real datasets CUB,
Oxford, and COCO datasets to verify that SeMani can distinguish the
entity-relevant and -irrelevant regions and achieve more precise and flexible
manipulation in a zero-shot manner compared with baseline methods. Our codes
and models will be released at https://github.com/Yikai-Wang/SeMani.
Related papers
- Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing [4.948910649137149]
Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation.
We show how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images.
We propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing.
arXiv Detail & Related papers (2024-11-12T21:34:30Z) - Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing [4.948910649137149]
Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation.
We investigate how text and image latents individually and jointly contribute to the semantics of generated images.
We propose a simple and effective Extract-Manipulate-Sample framework for zero-shot fine-grained image editing.
arXiv Detail & Related papers (2024-08-23T19:00:52Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - Interactive Image Manipulation with Complex Text Instructions [14.329411711887115]
We propose a novel image manipulation method that interactively edits an image using complex text instructions.
It allows users to not only improve the accuracy of image manipulation but also achieve complex tasks such as enlarging, dwindling, or removing objects.
Extensive experiments on the Caltech-UCSD Birds-200-2011 (CUB) dataset and Microsoft Common Objects in Context (MS COCO) datasets demonstrate our proposed method can enable interactive, flexible, and accurate image manipulation in real-time.
arXiv Detail & Related papers (2022-11-25T08:05:52Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - SESAME: Semantic Editing of Scenes by Adding, Manipulating or Erasing
Objects [127.7627687126465]
SESAME is a novel generator-discriminator pair for Semantic Editing of Scenes by Adding, Manipulating or Erasing objects.
In our setup, the user provides the semantic labels of the areas to be edited and the generator synthesizes the corresponding pixels.
We evaluate our model on a diverse set of datasets and report state-of-the-art performance on two tasks.
arXiv Detail & Related papers (2020-04-10T10:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.