Beyond Surface Statistics: Scene Representations in a Latent Diffusion
Model
- URL: http://arxiv.org/abs/2306.05720v2
- Date: Sat, 4 Nov 2023 19:22:35 GMT
- Title: Beyond Surface Statistics: Scene Representations in a Latent Diffusion
Model
- Authors: Yida Chen, Fernanda Vi\'egas, Martin Wattenberg
- Abstract summary: Latent diffusion models (LDMs) produce realistic images, yet the inner workings of these models remain mysterious.
In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry?
Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction.
- Score: 52.634378583311054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Latent diffusion models (LDMs) exhibit an impressive ability to produce
realistic images, yet the inner workings of these models remain mysterious.
Even when trained purely on images without explicit depth information, they
typically output coherent pictures of 3D scenes. In this work, we investigate a
basic interpretability question: does an LDM create and use an internal
representation of simple scene geometry? Using linear probes, we find evidence
that the internal activations of the LDM encode linear representations of both
3D depth data and a salient-object / background distinction. These
representations appear surprisingly early in the denoising process$-$well
before a human can easily make sense of the noisy images. Intervention
experiments further indicate these representations play a causal role in image
synthesis, and may be used for simple high-level editing of an LDM's output.
Project page: https://yc015.github.io/scene-representation-diffusion-model/
Related papers
- UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis [28.245380116188883]
Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views.<n>Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas diffusion-based methods hallucinate plausible content yet incur heavy training-time costs.<n>We propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation.
arXiv Detail & Related papers (2025-12-23T07:08:00Z) - Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models [3.9373541926236766]
We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data.
We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, or from sparse input views.
arXiv Detail & Related papers (2024-06-18T23:14:29Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes.
First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes.
Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z) - WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space [77.92350895927922]
We propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs)
Our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry.
This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
arXiv Detail & Related papers (2023-11-22T18:25:51Z) - MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection [31.58403386994297]
We propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy.
Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations.
To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception.
arXiv Detail & Related papers (2023-08-18T09:39:52Z) - Learning 3D Photography Videos via Self-supervised Diffusion on Single
Images [105.81348348510551]
3D photography renders a static image into a video with appealing 3D visual effects.
Existing approaches typically first conduct monocular depth estimation, then render the input frame to subsequent frames with various viewpoints.
We present a novel task: out-animation, which extends the space and time of input objects.
arXiv Detail & Related papers (2023-02-21T16:18:40Z) - RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and
Generation [68.06991943974195]
We present RenderDiffusion, the first diffusion model for 3D generation and inference, trained using only monocular 2D supervision.
We evaluate RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images.
arXiv Detail & Related papers (2022-11-17T20:17:04Z) - Representation Learning with Diffusion Models [0.0]
Diffusion models (DMs) have achieved state-of-the-art results for image synthesis tasks as well as density estimation.
We introduce a framework for learning such representations with diffusion models (LRDM)
In particular, the DM and the representation encoder are trained jointly in order to learn rich representations specific to the generative denoising process.
arXiv Detail & Related papers (2022-10-20T07:26:47Z) - GIRAFFE: Representing Scenes as Compositional Generative Neural Feature
Fields [45.21191307444531]
Deep generative models allow for photorealistic image synthesis at high resolutions.
But for many applications, this is not enough: content creation also needs to be controllable.
Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis.
arXiv Detail & Related papers (2020-11-24T14:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.