Target-Free Text-guided Image Manipulation
- URL: http://arxiv.org/abs/2211.14544v1
- Date: Sat, 26 Nov 2022 11:45:30 GMT
- Title: Target-Free Text-guided Image Manipulation
- Authors: Wan-Cyuan Fan, Cheng-Fu Yang, Chiao-An Yang, Yu-Chiang Frank Wang
- Abstract summary: We propose a Cyclic-Manipulation GAN (cManiGAN) to realize where and how to edit the image regions of interest.
Specifically, the image editor in cManiGAN learns to identify and complete the input image.
Cross-modal interpreter and reasoner are deployed to verify the semantic correctness of the output image.
- Score: 30.3884508895415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle the problem of target-free text-guided image manipulation, which
requires one to modify the input reference image based on the given text
instruction, while no ground truth target image is observed during training. To
address this challenging task, we propose a Cyclic-Manipulation GAN (cManiGAN)
in this paper, which is able to realize where and how to edit the image regions
of interest. Specifically, the image editor in cManiGAN learns to identify and
complete the input image, while cross-modal interpreter and reasoner are
deployed to verify the semantic correctness of the output image based on the
input instruction. While the former utilizes factual/counterfactual description
learning for authenticating the image semantics, the latter predicts the "undo"
instruction and provides pixel-level supervision for the training of cManiGAN.
With such operational cycle-consistency, our cManiGAN can be trained in the
above weakly supervised setting. We conduct extensive experiments on the
datasets of CLEVR and COCO, and the effectiveness and generalizability of our
proposed method can be successfully verified. Project page:
https://sites.google.com/view/wancyuanfan/projects/cmanigan.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - Towards Generic Image Manipulation Detection with Weakly-Supervised
Self-Consistency Learning [49.43362803584032]
We propose weakly-supervised image manipulation detection.
Such a setting can leverage more training images and has the potential to adapt quickly to new manipulation techniques.
Two consistency properties are learned: multi-source consistency (MSC) and inter-patch consistency (IPC)
arXiv Detail & Related papers (2023-09-03T19:19:56Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based
Image Manipulation [49.07254928141495]
We propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing.
Our key idea is to employ a pair of transformation images as visual instructions, which precisely captures human intention.
Our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.
arXiv Detail & Related papers (2023-08-02T01:57:11Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation [4.078926358349661]
Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space.
Due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images.
We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation.
arXiv Detail & Related papers (2022-10-08T05:12:25Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.