Object-Centric Image Generation from Layouts
- URL: http://arxiv.org/abs/2003.07449v2
- Date: Thu, 3 Dec 2020 16:02:11 GMT
- Title: Object-Centric Image Generation from Layouts
- Authors: Tristan Sylvain and Pengchuan Zhang and Yoshua Bengio and R Devon
Hjelm and Shikhar Sharma
- Abstract summary: We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
- Score: 93.10217725729468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent impressive results on single-object and single-domain image
generation, the generation of complex scenes with multiple objects remains
challenging. In this paper, we start with the idea that a model must be able to
understand individual objects and relationships between objects in order to
generate complex scenes well. Our layout-to-image-generation method, which we
call Object-Centric Generative Adversarial Network (or OC-GAN), relies on a
novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of
the spatial relationships between objects in the scene, which lead to our
model's improved layout-fidelity. We also propose changes to the conditioning
mechanism of the generator that enhance its object instance-awareness. Apart
from improving image quality, our contributions mitigate two failure modes in
previous approaches: (1) spurious objects being generated without corresponding
bounding boxes in the layout, and (2) overlapping bounding boxes in the layout
leading to merged objects in images. Extensive quantitative evaluation and
ablation studies demonstrate the impact of our contributions, with our model
outperforming previous state-of-the-art approaches on both the COCO-Stuff and
Visual Genome datasets. Finally, we address an important limitation of
evaluation metrics used in previous works by introducing SceneFID -- an
object-centric adaptation of the popular Fr{\'e}chet Inception Distance metric,
that is better suited for multi-object images.
Related papers
- SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects [20.978091381109294]
We propose a method to generate articulated objects from a single image.
Our method generates an articulated object that is visually consistent with the input image.
Our experiments show that our method outperforms the state-of-the-art in articulated object creation.
arXiv Detail & Related papers (2024-10-21T20:41:32Z) - Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation [10.416673784744281]
We propose a weighted-merge method to merge multiple reference image features into corresponding objects.
Our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation.
arXiv Detail & Related papers (2024-09-26T15:04:13Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training.
We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects.
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - CoSformer: Detecting Co-Salient Object with Transformers [2.3148470932285665]
Co-Salient Object Detection (CoSOD) aims at simulating the human visual system to discover the common and salient objects from a group of relevant images.
We propose the Co-Salient Object Detection Transformer (CoSformer) network to capture both salient and common visual patterns from multiple images.
arXiv Detail & Related papers (2021-04-30T02:39:12Z) - Exploiting Relationship for Complex-scene Image Generation [43.022978211274065]
This work explores relationship-aware complex-scene image generation, where multiple objects are inter-related as a scene graph.
We propose three major updates in the generation framework. First, reasonable spatial layouts are inferred by jointly considering the semantics and relationships among objects.
Second, since the relations between objects significantly influence an object's appearance, we design a relation-guided generator to generate objects reflecting their relationships.
Third, a novel scene graph discriminator is proposed to guarantee the consistency between the generated image and the input scene graph.
arXiv Detail & Related papers (2021-04-01T09:21:39Z) - Context-Aware Layout to Image Generation with Enhanced Object Appearance [123.62597976732948]
A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff)
Existing L2I models have made great progress, but object-to-object and object-to-stuff relations are often broken.
We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators.
arXiv Detail & Related papers (2021-03-22T14:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.