VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
- URL: http://arxiv.org/abs/2510.13454v1
- Date: Wed, 15 Oct 2025 11:55:08 GMT
- Title: VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
- Authors: Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, Konrad Schindler,
- Abstract summary: A text-to-video generator can be combined with a 3D reconstruction system as a "decoder"<n>We introduce VIST3A, a general framework that does just that.<n>We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models.
- Score: 69.72818094722186
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.
Related papers
- PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing [51.56943889042673]
PercHead is a method for single-image 3D head reconstruction and semantic 3D editing.<n>We develop a unified base model for reconstructing view-consistent 3D heads from a single input image.<n>We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI.
arXiv Detail & Related papers (2025-11-04T17:59:15Z) - TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction [137.34863114016483]
TAR3D is a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT)<n>We show that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
arXiv Detail & Related papers (2024-12-22T08:28:20Z) - OneTo3D: One Image to Re-editable Dynamic 3D Model and Video Generation [0.0]
One image to editable dynamic 3D model and video generation is novel direction and change in the research area of single image to 3D representation or 3D reconstruction of image.
We propose the OneTo3D, a method and theory to used one single image to generate the editable 3D model and generate the targeted semantic continuous time-unlimited 3D video.
arXiv Detail & Related papers (2024-05-10T15:44:11Z) - IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality
3D Generation [96.32684334038278]
In this paper, we explore the design space of text-to-3D models.
We significantly improve multi-view generation by considering video instead of image generators.
Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x.
arXiv Detail & Related papers (2024-02-13T18:59:51Z) - GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting [52.150502668874495]
We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation.
GALA3D is a user-friendly, end-to-end framework for state-of-the-art scene-level 3D content generation and controllable editing.
arXiv Detail & Related papers (2024-02-11T13:40:08Z) - Improving 3D-aware Image Synthesis with A Geometry-aware Discriminator [68.0533826852601]
3D-aware image synthesis aims at learning a generative model that can render photo-realistic 2D images while capturing decent underlying 3D shapes.
Existing methods fail to obtain moderate 3D shapes.
We propose a geometry-aware discriminator to improve 3D-aware GANs.
arXiv Detail & Related papers (2022-09-30T17:59:37Z) - Fine Detailed Texture Learning for 3D Meshes with Generative Models [33.42114674602613]
This paper presents a method to reconstruct high-quality textured 3D models from both multi-view and single-view images.
In the first stage, we focus on learning accurate geometry, whereas in the second stage, we focus on learning the texture with a generative adversarial network.
We demonstrate that our method achieves superior 3D textured models compared to the previous works.
arXiv Detail & Related papers (2022-03-17T14:50:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.