Compositional Transformers for Scene Generation
- URL: http://arxiv.org/abs/2111.08960v1
- Date: Wed, 17 Nov 2021 08:11:42 GMT
- Title: Compositional Transformers for Scene Generation
- Authors: Drew A. Hudson and C. Lawrence Zitnick
- Abstract summary: We introduce the GANformer2 model, an iterative object-oriented transformer, explored for the task of generative modeling.
We show it achieves state-of-the-art performance in terms of visual quality, diversity and consistency.
Further experiments demonstrate the model's disentanglement and provide a deeper insight into its generative process.
- Score: 13.633811200719627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the GANformer2 model, an iterative object-oriented transformer,
explored for the task of generative modeling. The network incorporates strong
and explicit structural priors, to reflect the compositional nature of visual
scenes, and synthesizes images through a sequential process. It operates in two
stages: a fast and lightweight planning phase, where we draft a high-level
scene layout, followed by an attention-based execution phase, where the layout
is being refined, evolving into a rich and detailed picture. Our model moves
away from conventional black-box GAN architectures that feature a flat and
monolithic latent space towards a transparent design that encourages
efficiency, controllability and interpretability. We demonstrate GANformer2's
strengths and qualities through a careful evaluation over a range of datasets,
from multi-object CLEVR scenes to the challenging COCO images, showing it
successfully achieves state-of-the-art performance in terms of visual quality,
diversity and consistency. Further experiments demonstrate the model's
disentanglement and provide a deeper insight into its generative process, as it
proceeds step-by-step from a rough initial sketch, to a detailed layout that
accounts for objects' depths and dependencies, and up to the final
high-resolution depiction of vibrant and intricate real-world scenes. See
https://github.com/dorarad/gansformer for model implementation.
Related papers
- Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting [47.014044892025346]
Architect is a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting.
Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene.
arXiv Detail & Related papers (2024-11-14T22:15:48Z) - Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.
voxelization infers per-object occupancy probabilities at individual spatial locations.
Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Evolutive Rendering Models [91.99498492855187]
We present textitevolutive rendering models, a methodology where rendering models possess the ability to evolve and adapt dynamically throughout rendering process.
In particular, we present a comprehensive learning framework that enables the optimization of three principal rendering elements.
A detailed analysis of gradient characteristics is performed to facilitate a stable goal-oriented elements evolution.
arXiv Detail & Related papers (2024-05-27T17:40:00Z) - CroCo v2: Improved Cross-view Completion Pre-training for Stereo
Matching and Optical Flow [22.161967080759993]
Self-supervised pre-training methods have not yet delivered on dense geometric vision tasks such as stereo matching or optical flow.
We build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene.
We show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques.
arXiv Detail & Related papers (2022-11-18T18:18:53Z) - Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image.
We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation.
Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - Cross-View Panorama Image Synthesis [68.35351563852335]
PanoGAN is a novel adversarial feedback GAN framework named.
PanoGAN enables high-quality panorama image generation with more convincing details than state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-22T15:59:44Z) - Generative Adversarial Transformers [13.633811200719627]
We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling.
The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency.
We show it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency.
arXiv Detail & Related papers (2021-03-01T18:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.