ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation
- URL: http://arxiv.org/abs/2204.04428v1
- Date: Sat, 9 Apr 2022 09:01:19 GMT
- Title: ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation
- Authors: Jianan Wang, Guansong Lu, Hang Xu, Zhenguo Li, Chunjing Xu and Yanwei
Fu
- Abstract summary: We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
- Score: 97.36550187238177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing text-guided image manipulation methods aim to modify the appearance
of the image or to edit a few objects in a virtual or simple scenario, which is
far from practical application. In this work, we study a novel task on
text-guided image manipulation on the entity level in the real world. The task
imposes three basic requirements, (1) to edit the entity consistent with the
text descriptions, (2) to preserve the text-irrelevant regions, and (3) to
merge the manipulated entity into the image naturally. To this end, we propose
a new transformer-based framework based on the two-stage image synthesis
method, namely \textbf{ManiTrans}, which can not only edit the appearance of
entities but also generate new entities corresponding to the text guidance. Our
framework incorporates a semantic alignment module to locate the image regions
to be manipulated, and a semantic loss to help align the relationship between
the vision and language. We conduct extensive experiments on the real datasets,
CUB, Oxford, and COCO datasets to verify that our method can distinguish the
relevant and irrelevant regions and achieve more precise and flexible
manipulation compared with baseline methods. The project homepage is
\url{https://jawang19.github.io/manitrans}.
Related papers
- Entity-Level Text-Guided Image Manipulation [70.81648416508867]
We study a novel task on text-guided image manipulation on the entity level in the real world (eL-TGIM)
We propose an elegant framework, dubbed as SeMani, forming the Semantic Manipulation of real-world images.
In the semantic alignment phase, SeMani incorporates a semantic alignment module to locate the entity-relevant region to be manipulated.
In the image manipulation phase, SeMani adopts a generative model to synthesize new images conditioned on the entity-irrelevant regions and target text descriptions.
arXiv Detail & Related papers (2023-02-22T13:56:23Z) - Towards Arbitrary Text-driven Image Manipulation via Space Alignment [49.3370305074319]
We propose a new Text-driven image Manipulation framework via Space Alignment (TMSA)
TMSA aims to align the same semantic regions in CLIP and StyleGAN spaces.
The framework can support arbitrary image editing mode without additional cost.
arXiv Detail & Related papers (2023-01-25T16:20:01Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - Interactive Image Manipulation with Complex Text Instructions [14.329411711887115]
We propose a novel image manipulation method that interactively edits an image using complex text instructions.
It allows users to not only improve the accuracy of image manipulation but also achieve complex tasks such as enlarging, dwindling, or removing objects.
Extensive experiments on the Caltech-UCSD Birds-200-2011 (CUB) dataset and Microsoft Common Objects in Context (MS COCO) datasets demonstrate our proposed method can enable interactive, flexible, and accurate image manipulation in real-time.
arXiv Detail & Related papers (2022-11-25T08:05:52Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.