Related papers: Compositional Transformers for Scene Generation

Compositional Transformers for Scene Generation

URL: http://arxiv.org/abs/2111.08960v1
Date: Wed, 17 Nov 2021 08:11:42 GMT
Title: Compositional Transformers for Scene Generation
Authors: Drew A. Hudson and C. Lawrence Zitnick
Abstract summary: We introduce the GANformer2 model, an iterative object-oriented transformer, explored for the task of generative modeling. We show it achieves state-of-the-art performance in terms of visual quality, diversity and consistency. Further experiments demonstrate the model's disentanglement and provide a deeper insight into its generative process.
Score: 13.633811200719627
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce the GANformer2 model, an iterative object-oriented transformer, explored for the task of generative modeling. The network incorporates strong and explicit structural priors, to reflect the compositional nature of visual scenes, and synthesizes images through a sequential process. It operates in two stages: a fast and lightweight planning phase, where we draft a high-level scene layout, followed by an attention-based execution phase, where the layout is being refined, evolving into a rich and detailed picture. Our model moves away from conventional black-box GAN architectures that feature a flat and monolithic latent space towards a transparent design that encourages efficiency, controllability and interpretability. We demonstrate GANformer2's strengths and qualities through a careful evaluation over a range of datasets, from multi-object CLEVR scenes to the challenging COCO images, showing it successfully achieves state-of-the-art performance in terms of visual quality, diversity and consistency. Further experiments demonstrate the model's disentanglement and provide a deeper insight into its generative process, as it proceeds step-by-step from a rough initial sketch, to a detailed layout that accounts for objects' depths and dependencies, and up to the final high-resolution depiction of vibrant and intricate real-world scenes. See https://github.com/dorarad/gansformer for model implementation.

Related papers

Multi-View Depth Consistent Image Generation Using Generative AI Models: Application on Architectural Design of University Buildings [20.569648863933285]
We propose a novel three-stage consistent image generation framework using generative AI models. We employ ControlNet as the backbone and optimize it to accommodate multi-view inputs of architectural shoebox models. Experimental results demonstrate that the proposed framework can generate multi-view architectural images with consistent style and structural coherence.
arXiv Detail & Related papers (2025-03-05T00:16:09Z)
Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting [47.014044892025346]
Architect is a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting. Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene.
arXiv Detail & Related papers (2024-11-14T22:15:48Z)
Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning. voxelization infers per-object occupancy probabilities at individual spatial locations. Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z)
Evolutive Rendering Models [91.99498492855187]
We present textitevolutive rendering models, a methodology where rendering models possess the ability to evolve and adapt dynamically throughout rendering process. In particular, we present a comprehensive learning framework that enables the optimization of three principal rendering elements. A detailed analysis of gradient characteristics is performed to facilitate a stable goal-oriented elements evolution.
arXiv Detail & Related papers (2024-05-27T17:40:00Z)
CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow [22.161967080759993]
Self-supervised pre-training methods have not yet delivered on dense geometric vision tasks such as stereo matching or optical flow. We build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene. We show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques.
arXiv Detail & Related papers (2022-11-18T18:18:53Z)
Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image. We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation. Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks. After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z)
Cross-View Panorama Image Synthesis [68.35351563852335]
PanoGAN is a novel adversarial feedback GAN framework named. PanoGAN enables high-quality panorama image generation with more convincing details than state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-22T15:59:44Z)
Generative Adversarial Transformers [13.633811200719627]
We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency. We show it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency.
arXiv Detail & Related papers (2021-03-01T18:54:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.