Related papers: Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

URL: http://arxiv.org/abs/2404.18212v1
Date: Sun, 28 Apr 2024 15:07:53 GMT
Title: Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Authors: Navve Wasserman, Noam Rotstein, Roy Ganz, Ron Kimmel,
Abstract summary: We train a diffusion model to inverse the inpainting process, effectively adding objects into images. We provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions.
Score: 8.399234415641319
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. We show that the trained model surpasses existing ones both qualitatively and quantitatively, and release the large-scale dataset alongside the trained models for the community.

Related papers

BrushEdit: All-In-One Image Inpainting and Editing [79.55816192146762]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm. We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model. Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z)
Improving Text-guided Object Inpainting with Semantic Pre-inpainting [95.17396565347936]
We decompose the typical single-stage object inpainting into two cascaded processes: semantic pre-inpainting and high-fieldity object generation. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion framework.
arXiv Detail & Related papers (2024-09-12T17:55:37Z)
EraseDraw: Learning to Draw Step-by-Step via Erasing Objects from Images [24.55843674256795]
Prior works often fail by making global changes to the image, inserting objects in unrealistic spatial locations, and generating inaccurate lighting details. We observe that while state-of-the-art models perform poorly on object insertion, they can remove objects and erase the background in natural images very well. We show compelling results on diverse insertion prompts and images across various domains.
arXiv Detail & Related papers (2024-08-31T18:37:48Z)
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model [81.96954332787655]
We introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. In experiments, Diffree adds new objects with a high success rate while maintaining background consistency, spatial, and object relevance and quality.
arXiv Detail & Related papers (2024-07-24T03:58:58Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model [76.02314305164595]
This work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users. We take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image. In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts.
arXiv Detail & Related papers (2024-06-03T07:14:19Z)
Outline-Guided Object Inpainting with Diffusion Models [11.391452115311798]
Instance segmentation datasets play a crucial role in training accurate and robust computer vision models. We show how this issue can be mitigated by starting with small annotated instance segmentation datasets and augmenting them to obtain a sizeable annotated dataset. We generate new images using a diffusion-based inpainting model to fill out the masked area with a desired object class by guiding the diffusion through the object outline.
arXiv Detail & Related papers (2024-02-26T09:21:17Z)
Towards Language-Driven Video Inpainting via Multimodal Large Language Models [116.22805434658567]
We introduce a new task -- language-driven video inpainting. It uses natural language instructions to guide the inpainting process. We present the Remove Objects from Videos by Instructions dataset.
arXiv Detail & Related papers (2024-01-18T18:59:13Z)
Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency [78.0488707697235]
Post-processing approach dubbed ASUKA (Aligned Stable inpainting with UnKnown Areas prior) to improve inpainting models.<n>Masked Auto-Encoder (MAE) for reconstruction-based priors mitigates object hallucination.<n> specialized VAE decoder that treats latent-to-image decoding as a local task.
arXiv Detail & Related papers (2023-12-08T05:08:06Z)
Unlocking Spatial Comprehension in Text-to-Image Diffusion Models [33.99474729408903]
CompFuser is an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene.
arXiv Detail & Related papers (2023-11-28T19:00:02Z)
Magicremover: Tuning-free Text-guided Image inpainting with Diffusion Models [24.690863845885367]
We propose MagicRemover, a tuning-free method that leverages the powerful diffusion models for text-guided image inpainting. We introduce an attention guidance strategy to constrain the sampling process of diffusion models, enabling the erasing of instructed areas and the restoration of occluded content.
arXiv Detail & Related papers (2023-10-04T14:34:11Z)
Inst-Inpaint: Instructing to Remove Objects with Diffusion Models [18.30057229657246]
In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. We present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts.
arXiv Detail & Related papers (2023-04-06T17:29:50Z)
DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing. Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
Perceptual Artifacts Localization for Inpainting [60.5659086595901]
We propose a new learning task of automatic segmentation of inpainting perceptual artifacts. We train advanced segmentation networks on a dataset to reliably localize inpainting artifacts within inpainted images. We also propose a new evaluation metric called Perceptual Artifact Ratio (PAR), which is the ratio of objectionable inpainted regions to the entire inpainted area.
arXiv Detail & Related papers (2022-08-05T18:50:51Z)
LayoutBERT: Masked Language Layout Model for Object Insertion [3.4806267677524896]
We propose layoutBERT for the object insertion task. It uses a novel self-supervised masked language model objective and bidirectional multi-head self-attention. We provide both qualitative and quantitative evaluations on datasets from diverse domains.
arXiv Detail & Related papers (2022-04-30T21:35:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.