LayoutBERT: Masked Language Layout Model for Object Insertion
- URL: http://arxiv.org/abs/2205.00347v1
- Date: Sat, 30 Apr 2022 21:35:38 GMT
- Title: LayoutBERT: Masked Language Layout Model for Object Insertion
- Authors: Kerem Turgutlu, Sanat Sharma and Jayant Kumar
- Abstract summary: We propose layoutBERT for the object insertion task.
It uses a novel self-supervised masked language model objective and bidirectional multi-head self-attention.
We provide both qualitative and quantitative evaluations on datasets from diverse domains.
- Score: 3.4806267677524896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image compositing is one of the most fundamental steps in creative workflows.
It involves taking objects/parts of several images to create a new image,
called a composite. Currently, this process is done manually by creating
accurate masks of objects to be inserted and carefully blending them with the
target scene or images, usually with the help of tools such as Photoshop or
GIMP. While there have been several works on automatic selection of objects for
creating masks, the problem of object placement within an image with the
correct position, scale, and harmony remains a difficult problem with limited
exploration. Automatic object insertion in images or designs is a difficult
problem as it requires understanding of the scene geometry and the color
harmony between objects. We propose LayoutBERT for the object insertion task.
It uses a novel self-supervised masked language model objective and
bidirectional multi-head self-attention. It outperforms previous layout-based
likelihood models and shows favorable properties in terms of model capacity. We
demonstrate the effectiveness of our approach for object insertion in the image
compositing setting and other settings like documents and design templates. We
further demonstrate the usefulness of the learned representations for
layout-based retrieval tasks. We provide both qualitative and quantitative
evaluations on datasets from diverse domains like COCO, PublayNet, and two new
datasets which we call Image Layouts and Template Layouts. Image Layouts which
consists of 5.8 million images with layout annotations is the largest image
layout dataset to our knowledge. We also share ablation study results on the
effect of dataset size, model size and class sample size for this task.
Related papers
- GroundingBooth: Grounding Text-to-Image Customization [17.185571339157075]
We introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects.
Our proposed text-image grounding module and masked cross-attention layer allow us to generate personalized images with both accurate layout alignment and identity preservation.
arXiv Detail & Related papers (2024-09-13T03:40:58Z) - EraseDraw: Learning to Insert Objects by Erasing Them from Images [24.55843674256795]
Prior works often fail by making global changes to the image, inserting objects in unrealistic spatial locations, and generating inaccurate lighting details.
We observe that while state-of-the-art models perform poorly on object insertion, they can remove objects and erase the background in natural images very well.
We show compelling results on diverse insertion prompts and images across various domains.
arXiv Detail & Related papers (2024-08-31T18:37:48Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - Paint by Inpaint: Learning to Add Image Objects by Removing Them First [8.399234415641319]
We train a diffusion model to inverse the inpainting process, effectively adding objects into images.
We provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions.
arXiv Detail & Related papers (2024-04-28T15:07:53Z) - Customizing Text-to-Image Diffusion with Camera Viewpoint Control [53.621518249820745]
We introduce a new task -- enabling explicit control of camera viewpoint for model customization.
This allows us to modify object properties amongst various background scenes via text prompts.
We propose to condition the 2D diffusion process on rendered, view-dependent features of the new object.
arXiv Detail & Related papers (2024-04-18T16:59:51Z) - Outline-Guided Object Inpainting with Diffusion Models [11.391452115311798]
Instance segmentation datasets play a crucial role in training accurate and robust computer vision models.
We show how this issue can be mitigated by starting with small annotated instance segmentation datasets and augmenting them to obtain a sizeable annotated dataset.
We generate new images using a diffusion-based inpainting model to fill out the masked area with a desired object class by guiding the diffusion through the object outline.
arXiv Detail & Related papers (2024-02-26T09:21:17Z) - High-Quality Entity Segmentation [110.55724145851725]
CropFormer is designed to tackle the intractability of instance-level segmentation on high-resolution images.
It improves mask prediction by fusing high-res image crops that provide more fine-grained image details and the full image.
With CropFormer, we achieve a significant AP gain of $1.9$ on the challenging entity segmentation task.
arXiv Detail & Related papers (2022-11-10T18:58:22Z) - Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to
Parcel Logistics [58.720142291102135]
We present a fully automated pipeline to generate a synthetic dataset for instance segmentation in four steps.
We first scrape images for the objects of interest from popular image search engines.
We compare three different methods for image selection: Object-agnostic pre-processing, manual image selection and CNN-based image selection.
arXiv Detail & Related papers (2022-10-18T12:49:04Z) - Scene Graph to Image Generation with Contextualized Object Layout
Refinement [92.85331019618332]
We propose a novel method to generate images from scene graphs.
Our approach improves the layout coverage by almost 20 points and drops object overlap to negligible amounts.
arXiv Detail & Related papers (2020-09-23T06:27:54Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.