Related papers: Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models

Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models

URL: http://arxiv.org/abs/2309.07986v2
Date: Fri, 26 Jul 2024 11:14:21 GMT
Title: Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models
Authors: James Burgess, Kuan-Chieh Wang, Serena Yeung-Levy,
Abstract summary: We show that certain 3D scene representations are encoded in the text embedding space of models like Stable Diffusion. We exploit the 3D scene representations for 3D vision tasks, namely, view-controlled text-to-image generation, and novel view synthesis from a single image.
Score: 4.036372578802888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image diffusion models generate impressive and realistic images, but do they learn to represent the 3D world from only 2D supervision? We demonstrate that yes, certain 3D scene representations are encoded in the text embedding space of models like Stable Diffusion. Our approach, Viewpoint Neural Textual Inversion (ViewNeTI), is to discover 3D view tokens; these tokens control the 3D viewpoint - the rendering pose in a scene - of generated images. Specifically, we train a small neural mapper to take continuous camera viewpoint parameters and predict a view token (a word embedding). This token conditions diffusion generation via cross-attention to produce images with the desired camera viewpoint. Using ViewNeTI as an evaluation tool, we report two findings: first, the text latent space has a continuous view-control manifold for particular 3D scenes; second, we find evidence for a generalized view-control manifold for all scenes. We conclude that since the view token controls the 3D `rendering' viewpoint, there is likely a scene representation embedded in frozen 2D diffusion models. Finally, we exploit the 3D scene representations for 3D vision tasks, namely, view-controlled text-to-image generation, and novel view synthesis from a single image, where our approach sets state-of-the-art for LPIPS. Code available at https://github.com/jmhb0/view_neti

Related papers

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes [72.02827211293736]
MagicDrive3D is a novel framework for controllable 3D street scene generation.<n>It supports multi-condition control, including road maps, 3D objects, and text descriptions.<n>It generates diverse, high-quality 3D driving scenes, supports any-view rendering, and enhances downstream tasks like BEV segmentation.
arXiv Detail & Related papers (2024-05-23T12:04:51Z)
Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models. We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z)
WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space [77.92350895927922]
We propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs) Our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
arXiv Detail & Related papers (2023-11-22T18:25:51Z)
Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model [39.64952340472541]
We propose a controllable text-to-3D avatar generation method whose facial expression is controllable. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images. We demonstrate the empirical results and discuss the effectiveness of our method.
arXiv Detail & Related papers (2023-09-07T08:14:46Z)
Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data [76.38261311948649]
Viewset Diffusion is a diffusion-based generator that outputs 3D objects while only using multi-view 2D data for supervision. We train a diffusion model to generate viewsets, but design the neural network generator to reconstruct internally corresponding 3D models. The model performs reconstruction efficiently, in a feed-forward manner, and is trained using only rendering losses using as few as three views per viewset.
arXiv Detail & Related papers (2023-06-13T16:18:51Z)
SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections [49.802462165826554]
We present SceneDreamer, an unconditional generative model for unbounded 3D scenes. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations.
arXiv Detail & Related papers (2023-02-02T18:59:16Z)
RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation [68.06991943974195]
We present RenderDiffusion, the first diffusion model for 3D generation and inference, trained using only monocular 2D supervision. We evaluate RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images.
arXiv Detail & Related papers (2022-11-17T20:17:04Z)
Recognizing Scenes from Novel Viewpoints [99.90914180489456]
Humans can perceive scenes in 3D from a handful of 2D views. For AI agents, the ability to recognize a scene from any viewpoint given only a few images enables them to efficiently interact with the scene and its objects. We propose a model which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoints by segmenting it into semantic categories.
arXiv Detail & Related papers (2021-12-02T18:59:40Z)
CoCoNets: Continuous Contrastive 3D Scene Representations [21.906643302668716]
This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos. We show the resulting 3D visual feature representations effectively scale across objects and scenes, imagine information occluded or missing from the input viewpoints, track objects over time, align semantically related objects in 3D, and improve 3D object detection.
arXiv Detail & Related papers (2021-04-08T15:50:47Z)
Continuous Object Representation Networks: Novel View Synthesis without Target View Supervision [26.885846254261626]
Continuous Object Representation Networks (CORN) is a conditional architecture that encodes an input image's geometry and appearance that map to a 3D consistent scene representation. CORN achieves well on challenging tasks such as novel view synthesis and single-view 3D reconstruction and performance comparable to state-of-the-art approaches that use direct supervision.
arXiv Detail & Related papers (2020-07-30T17:49:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.