Interactive Image Manipulation with Complex Text Instructions
- URL: http://arxiv.org/abs/2211.15352v1
- Date: Fri, 25 Nov 2022 08:05:52 GMT
- Title: Interactive Image Manipulation with Complex Text Instructions
- Authors: Ryugo Morita, Zhiqiang Zhang, Man M. Ho, Jinjia Zhou
- Abstract summary: We propose a novel image manipulation method that interactively edits an image using complex text instructions.
It allows users to not only improve the accuracy of image manipulation but also achieve complex tasks such as enlarging, dwindling, or removing objects.
Extensive experiments on the Caltech-UCSD Birds-200-2011 (CUB) dataset and Microsoft Common Objects in Context (MS COCO) datasets demonstrate our proposed method can enable interactive, flexible, and accurate image manipulation in real-time.
- Score: 14.329411711887115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, text-guided image manipulation has received increasing attention in
the research field of multimedia processing and computer vision due to its high
flexibility and controllability. Its goal is to semantically manipulate parts
of an input reference image according to the text descriptions. However, most
of the existing works have the following problems: (1) text-irrelevant content
cannot always be maintained but randomly changed, (2) the performance of image
manipulation still needs to be further improved, (3) only can manipulate
descriptive attributes. To solve these problems, we propose a novel image
manipulation method that interactively edits an image using complex text
instructions. It allows users to not only improve the accuracy of image
manipulation but also achieve complex tasks such as enlarging, dwindling, or
removing objects and replacing the background with the input image. To make
these tasks possible, we apply three strategies. First, the given image is
divided into text-relevant content and text-irrelevant content. Only the
text-relevant content is manipulated and the text-irrelevant content can be
maintained. Second, a super-resolution method is used to enlarge the
manipulation region to further improve the operability and to help manipulate
the object itself. Third, a user interface is introduced for editing the
segmentation map interactively to re-modify the generated image according to
the user's desires. Extensive experiments on the Caltech-UCSD Birds-200-2011
(CUB) dataset and Microsoft Common Objects in Context (MS COCO) datasets
demonstrate our proposed method can enable interactive, flexible, and accurate
image manipulation in real-time. Through qualitative and quantitative
evaluations, we show that the proposed model outperforms other state-of-the-art
methods.
Related papers
- DragText: Rethinking Text Embedding in Point-based Image Editing [3.1923251959845214]
We show that during the progressive editing of an input image in a diffusion model, the text embedding remains constant.
We propose DragText, which optimize text embedding in conjunction with the dragging process to pair with the modified image embedding.
arXiv Detail & Related papers (2024-07-25T07:57:55Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - Entity-Level Text-Guided Image Manipulation [70.81648416508867]
We study a novel task on text-guided image manipulation on the entity level in the real world (eL-TGIM)
We propose an elegant framework, dubbed as SeMani, forming the Semantic Manipulation of real-world images.
In the semantic alignment phase, SeMani incorporates a semantic alignment module to locate the entity-relevant region to be manipulated.
In the image manipulation phase, SeMani adopts a generative model to synthesize new images conditioned on the entity-irrelevant regions and target text descriptions.
arXiv Detail & Related papers (2023-02-22T13:56:23Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.