Compositional Scene Understanding through Inverse Generative Modeling
- URL: http://arxiv.org/abs/2505.21780v4
- Date: Mon, 23 Jun 2025 19:26:04 GMT
- Title: Compositional Scene Understanding through Inverse Generative Modeling
- Authors: Yanbo Wang, Justin Dauwels, Yilun Du,
- Abstract summary: We explore how generative models can be used to understand the properties of a scene given a natural image.<n>We build a visual generative model compositionally from smaller models over pieces of a scene.<n>We illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes.
- Score: 38.312556839792386
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Generative models have demonstrated remarkable abilities in generating high-fidelity visual content. In this work, we explore how generative models can further be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception. Code and visualizations are at https://energy-based-model.github.io/compositional-inference.
Related papers
- MCGM: Mask Conditional Text-to-Image Generative Model [1.909929271850469]
We propose a novel Conditional Mask Text-to-Image Generative Model (MCGM)
Our model builds upon the success of the Break-a-scene [1] model in generating new scenes using a single image with multiple subjects.
By introducing this additional level of control, MCGM offers a flexible and intuitive approach for generating specific poses for one or more subjects learned from a single image.
arXiv Detail & Related papers (2024-10-01T08:13:47Z) - UpFusion: Novel View Diffusion from Unposed Sparse View Observations [66.36092764694502]
UpFusion can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images.
We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images.
arXiv Detail & Related papers (2023-12-11T18:59:55Z) - Diffusion Self-Guidance for Controllable Image Generation [106.59989386924136]
Self-guidance provides greater control over generated images by guiding the internal representations of diffusion models.
We show how a simple set of properties can be composed to perform challenging image manipulations.
We also show that self-guidance can be used to edit real images.
arXiv Detail & Related papers (2023-06-01T17:59:56Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Compositional Visual Generation with Composable Diffusion Models [80.75258849913574]
We propose an alternative structured approach for compositional generation using diffusion models.
An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image.
The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
arXiv Detail & Related papers (2022-06-03T17:47:04Z) - Learning Generative Models of Textured 3D Meshes from Real-World Images [26.353307246909417]
We propose a GAN framework for generating textured triangle meshes without relying on such annotations.
We show that the performance of our approach is on par with prior work that relies on ground-truth keypoints.
arXiv Detail & Related papers (2021-03-29T14:07:37Z) - Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar
Reconstruction [9.747648609960185]
We present dynamic neural radiance fields for modeling the appearance and dynamics of a human face.
Especially, for telepresence applications in AR or VR, a faithful reproduction of the appearance including novel viewpoints or head-poses is required.
arXiv Detail & Related papers (2020-12-05T16:01:16Z) - Neural Scene Graphs for Dynamic Scenes [57.65413768984925]
We present the first neural rendering method that decomposes dynamic scenes into scene graphs.
We learn implicitly encoded scenes, combined with a jointly learned latent representation to describe objects with a single implicit function.
arXiv Detail & Related papers (2020-11-20T12:37:10Z) - Towards causal generative scene models via competition of experts [26.181132737834826]
We present an alternative approach which uses an inductive bias encouraging modularity by training an ensemble of generative models (experts)
During training, experts compete for explaining parts of a scene, and thus specialise on different object classes, with objects being identified as parts that re-occur across multiple scenes.
Our model allows for controllable sampling of individual objects and recombination of experts in physically plausible ways.
arXiv Detail & Related papers (2020-04-27T16:10:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.