Related papers: Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

URL: http://arxiv.org/abs/2408.05477v2
Date: Tue, 20 Aug 2024 10:16:00 GMT
Title: Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE
Authors: Yiying Yang, Fukun Yin, Jiayuan Fan, Xin Chen, Wanzhang Li, Gang Yu,
Abstract summary: We propose Scene123, a 3D scene generation model that ensures realism and diversity through the video generation framework. Specifically, we warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. To further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model.
Score: 22.072200443502457
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Artificial Intelligence Generated Content (AIGC) advances, a variety of methods have been developed to generate text, images, videos, and 3D objects from single or multimodal inputs, contributing efforts to emulate human-like cognitive content creation. However, generating realistic large-scale scenes from a single input presents a challenge due to the complexities involved in ensuring consistency across extrapolated views generated by models. Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, that not only ensures realism and diversity through the video generation framework but also uses implicit neural fields combined with Masked Autoencoders (MAE) to effectively ensures the consistency of unseen areas across views. Specifically, we initially warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. However, these filled images usually fail to maintain view consistency, thus we utilize the produced views to optimize a neural radiance field, enhancing geometric consistency. Moreover, to further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model. Extensive experiments demonstrate that our method can generate realistic and consistent scenes from a single prompt. Both qualitative and quantitative results indicate that our approach surpasses existing state-of-the-art methods. We show encourage video examples at https://yiyingyang12.github.io/Scene123.github.io/.

Related papers

SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation [0.0]
High-quality animatable 3D human avatars from a single image remains a significant challenge in computer vision.<n>We present SVAD, a novel approach that addresses these limitations by leveraging complementary strengths of existing techniques.<n>Our method generates synthetic training data through video diffusion, enhances it with identity preservation and image restoration modules, and utilizes this refined data to train 3DGS avatars.
arXiv Detail & Related papers (2025-05-08T17:59:58Z)
TexAVi: Generating Stereoscopic VR Video Clips from Text Descriptions [0.562479170374811]
This paper proposes an approach to coalesce existing generative systems to form a stereoscopic virtual reality video from text. Our work highlights the exciting possibilities of using natural language-driven graphics in fields like virtual reality simulations.
arXiv Detail & Related papers (2025-01-02T09:21:03Z)
Can Generative Video Models Help Pose Estimation? [42.10672365565019]
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose. We propose a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition.
arXiv Detail & Related papers (2024-12-20T18:58:24Z)
NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model [57.92709692193132]
NovelGS is a diffusion model for Gaussian Splatting given sparse-view images. We leverage the novel view denoising through a transformer-based network to generate 3D Gaussians.
arXiv Detail & Related papers (2024-11-25T07:57:17Z)
Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model [15.936267489962122]
We propose a novel method for object insertion in 3D content represented by Gaussian Splatting. Our approach introduces a multi-view diffusion model, dubbed MVInpainter, which is built upon a pre-trained stable video diffusion model. Within MVInpainter, we incorporate a ControlNet-based conditional injection module to enable controlled and more predictable multi-view generation.
arXiv Detail & Related papers (2024-09-25T13:52:50Z)
HR Human: Modeling Human Avatars with Triangular Mesh and High-Resolution Textures from Videos [52.23323966700072]
We present a framework for acquiring human avatars that are attached with high-resolution physically-based material textures and mesh from monocular video. Our method introduces a novel information fusion strategy to combine the information from the monocular video and synthesize virtual multi-view images. Experiments show that our approach outperforms previous representations in terms of high fidelity, and this explicit result supports deployment on common triangulars.
arXiv Detail & Related papers (2024-05-18T11:49:09Z)
TexPainter: Generative Mesh Texturing with Multi-view Consistency [20.366302413005734]
In this paper, we propose a novel method to enforce multi-view consistency. We use an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation. Our method improves consistency and overall quality of the generated textures as compared to competing state-of-the-arts.
arXiv Detail & Related papers (2024-05-17T18:41:36Z)
Envision3D: One Image to 3D with Anchor Views Interpolation [18.31796952040799]
We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image. It is capable of generating high-quality 3D content in terms of texture and geometry, surpassing previous image-to-3D baseline methods.
arXiv Detail & Related papers (2024-03-13T18:46:33Z)
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models [65.22994156658918]
We present a method that learns to generate multi-view images in a single denoising process from real-world data. We design an autoregressive generation that renders more 3D-consistent images at any viewpoint.
arXiv Detail & Related papers (2024-03-04T07:57:05Z)
Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives [70.32817882783608]
We present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. Unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points.
arXiv Detail & Related papers (2023-07-11T17:58:31Z)
NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models [85.20004959780132]
We introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation.
arXiv Detail & Related papers (2023-04-19T16:13:21Z)
Vision Transformer for NeRF-Based View Synthesis from a Single Input Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z)
Video-driven Neural Physically-based Facial Asset for Production [33.24654834163312]
We present a new learning-based, video-driven approach for generating dynamic facial geometries with high-quality physically-based assets. Our technique provides higher accuracy and visual fidelity than previous video-driven facial reconstruction and animation methods.
arXiv Detail & Related papers (2022-02-11T13:22:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.