Learning to Follow Object-Centric Image Editing Instructions Faithfully
- URL: http://arxiv.org/abs/2310.19145v1
- Date: Sun, 29 Oct 2023 20:39:11 GMT
- Title: Learning to Follow Object-Centric Image Editing Instructions Faithfully
- Authors: Tuhin Chakrabarty, Kanishk Singh, Arkadiy Saakyan, Smaranda Muresan
- Abstract summary: Current approaches focusing on image editing with natural language instructions rely on automatically generated paired data.
We significantly improve the quality of the paired data and enhance the supervision signal.
Our model is capable of performing fine-grained object-centric edits better than state-of-the-art baselines.
- Score: 26.69032113274608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language instructions are a powerful interface for editing the
outputs of text-to-image diffusion models. However, several challenges need to
be addressed: 1) underspecification (the need to model the implicit meaning of
instructions) 2) grounding (the need to localize where the edit has to be
performed), 3) faithfulness (the need to preserve the elements of the image not
affected by the edit instruction). Current approaches focusing on image editing
with natural language instructions rely on automatically generated paired data,
which, as shown in our investigation, is noisy and sometimes nonsensical,
exacerbating the above issues. Building on recent advances in segmentation,
Chain-of-Thought prompting, and visual question answering, we significantly
improve the quality of the paired data. In addition, we enhance the supervision
signal by highlighting parts of the image that need to be changed by the
instruction. The model fine-tuned on the improved data is capable of performing
fine-grained object-centric edits better than state-of-the-art baselines,
mitigating the problems outlined above, as shown by automatic and human
evaluations. Moreover, our model is capable of generalizing to domains unseen
during training, such as visual metaphors.
Related papers
- Learning Action and Reasoning-Centric Image Editing from Videos and Simulations [45.637947364341436]
AURORA dataset is a collection of high-quality training data, human-annotated and curated from videos and simulation engines.
We evaluate an AURORA-finetuned model on a new expert-curated benchmark covering 8 diverse editing tasks.
Our model significantly outperforms previous editing models as judged by human raters.
arXiv Detail & Related papers (2024-07-03T19:36:33Z) - Text Guided Image Editing with Automatic Concept Locating and Forgetting [27.70615803908037]
We propose a novel method called Locate and Forget (LaF) to locate potential target concepts in the image for modification.
Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-05-30T05:36:32Z) - ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing [77.12834553200632]
We introduce ReasonPix2Pix, a comprehensive reasoning-attentive instruction editing dataset.
The dataset is characterized by 1) reasoning instruction, 2) more realistic images from fine-grained categories, and 3) increased variances between input and edited images.
When fine-tuned with our dataset under supervised conditions, the model demonstrates superior performance in instructional editing tasks, independent of whether the tasks require reasoning or not.
arXiv Detail & Related papers (2024-05-18T06:03:42Z) - DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images [55.546024767130994]
We propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve.
It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image.
It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream.
arXiv Detail & Related papers (2024-04-27T22:45:47Z) - Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion
Models [68.47333676663312]
We show a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models.
The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens.
We illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
arXiv Detail & Related papers (2024-02-21T03:01:17Z) - AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for
Text-Based Continuity-Sensitive Image Editing [24.9487669818162]
We propose atemporal guided adaptive editing algorithm AdapEdit, which realizes adaptive image editing.
Our approach has a significant advantage in preserving model priors and does not require model training, fine-tuning extra data, or optimization.
We present our results over a wide variety of raw images and editing instructions, demonstrating competitive performance and showing it significantly outperforms the previous approaches.
arXiv Detail & Related papers (2023-12-13T09:45:58Z) - StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [86.92711729969488]
We exploit the amazing capacities of pretrained diffusion models for the editing of images.
They either finetune the model, or invert the image in the latent space of the pretrained model.
They suffer from two problems: Unsatisfying results for selected regions, and unexpected changes in nonselected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.