ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based
Image Manipulation
- URL: http://arxiv.org/abs/2308.00906v1
- Date: Wed, 2 Aug 2023 01:57:11 GMT
- Title: ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based
Image Manipulation
- Authors: Yasheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu,
Lili Qiu and Hideki Koike
- Abstract summary: We propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing.
Our key idea is to employ a pair of transformation images as visual instructions, which precisely captures human intention.
Our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.
- Score: 49.07254928141495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While language-guided image manipulation has made remarkable progress, the
challenge of how to instruct the manipulation process faithfully reflecting
human intentions persists. An accurate and comprehensive description of a
manipulation task using natural language is laborious and sometimes even
impossible, primarily due to the inherent uncertainty and ambiguity present in
linguistic expressions. Is it feasible to accomplish image manipulation without
resorting to external cross-modal language information? If this possibility
exists, the inherent modality gap would be effortlessly eliminated. In this
paper, we propose a novel manipulation methodology, dubbed ImageBrush, that
learns visual instructions for more accurate image editing. Our key idea is to
employ a pair of transformation images as visual instructions, which not only
precisely captures human intention but also facilitates accessibility in
real-world scenarios. Capturing visual instructions is particularly challenging
because it involves extracting the underlying intentions solely from visual
demonstrations and then applying this operation to a new image. To address this
challenge, we formulate visual instruction learning as a diffusion-based
inpainting problem, where the contextual information is fully exploited through
an iterative process of generation. A visual prompting encoder is carefully
devised to enhance the model's capacity in uncovering human intent behind the
visual instructions. Extensive experiments show that our method generates
engaging manipulation results conforming to the transformations entailed in
demonstrations. Moreover, our model exhibits robust generalization capabilities
on various downstream tasks such as pose transfer, image translation and video
inpainting.
Related papers
- PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [66.92809850624118]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions.
We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset.
Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z) - Text Guided Image Editing with Automatic Concept Locating and Forgetting [27.70615803908037]
We propose a novel method called Locate and Forget (LaF) to locate potential target concepts in the image for modification.
Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-05-30T05:36:32Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space
Manipulation [22.724306705927095]
We propose a novel approach that conduct text-driven image editing in the semantic latent space of diffusion model.
By aligning the temporal feature of the diffusion model with the semantic condition at generative process, we introduce a stable manipulation strategy.
We develop an interactive system named ChatFace, which combines the zero-shot reasoning ability of large language models to perform efficient manipulations.
arXiv Detail & Related papers (2023-05-24T05:28:37Z) - Target-Free Text-guided Image Manipulation [30.3884508895415]
We propose a Cyclic-Manipulation GAN (cManiGAN) to realize where and how to edit the image regions of interest.
Specifically, the image editor in cManiGAN learns to identify and complete the input image.
Cross-modal interpreter and reasoner are deployed to verify the semantic correctness of the output image.
arXiv Detail & Related papers (2022-11-26T11:45:30Z) - Remember What You have drawn: Semantic Image Manipulation with Memory [84.74585786082388]
We propose a memory-based Image Manipulation Network (MIM-Net) to generate realistic and text-conformed manipulated images.
To learn a robust memory, we propose a novel randomized memory training loss.
Experiments on the four popular datasets show the better performance of our method compared to the existing ones.
arXiv Detail & Related papers (2021-07-27T03:41:59Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.