VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer
- URL: http://arxiv.org/abs/2602.13818v1
- Date: Sat, 14 Feb 2026 15:28:15 GMT
- Title: VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer
- Authors: Zongcheng Han, Dongyan Cao, Haoran Sun, Yu Hong,
- Abstract summary: We propose a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into discrete tokens.<n>Experiments demonstrate that VAR-3D significantly outperforms existing methods in both generation quality and text-3D alignment.
- Score: 19.429606246646784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in auto-regressive transformers have achieved remarkable success in generative modeling. However, text-to-3D generation remains challenging, primarily due to bottlenecks in learning discrete 3D representations. Specifically, existing approaches often suffer from information loss during encoding, causing representational distortion before the quantization process. This effect is further amplified by vector quantization, ultimately degrading the geometric coherence of text-conditioned 3D shapes. Moreover, the conventional two-stage training paradigm induces an objective mismatch between reconstruction and text-conditioned auto-regressive generation. To address these issues, we propose View-aware Auto-Regressive 3D (VAR-3D), which intergrates a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into discrete tokens. Additionally, we introduce a rendering-supervised training strategy that couples discrete token prediction with visual reconstruction, encouraging the generative process to better preserve visual fidelity and structural consistency relative to the input text. Experiments demonstrate that VAR-3D significantly outperforms existing methods in both generation quality and text-3D alignment.
Related papers
- E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training [55.61379509038588]
We present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images.<n>E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry.
arXiv Detail & Related papers (2025-12-11T18:59:53Z) - IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction [82.53307702809606]
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions.<n>We propose InstanceGrounded Geometry Transformer (IGGT) to unify the knowledge for both spatial reconstruction and instance-level contextual understanding.
arXiv Detail & Related papers (2025-10-26T14:57:44Z) - MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation [44.94438766074643]
We introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer.<n>Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens.
arXiv Detail & Related papers (2025-03-26T13:00:51Z) - A Generative Approach to High Fidelity 3D Reconstruction from Text Data [0.0]
This research proposes a fully automated pipeline that seamlessly integrates text-to-image generation, various image processing techniques, and deep learning methods for reflection removal and 3D reconstruction.<n>By leveraging state-of-the-art generative models like Stable Diffusion, the methodology translates natural language inputs into detailed 3D models through a multi-stage workflow.<n>This approach addresses key challenges in generative reconstruction, such as maintaining semantic coherence, managing geometric complexity, and preserving detailed visual information.
arXiv Detail & Related papers (2025-03-05T16:54:15Z) - TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction [137.34863114016483]
TAR3D is a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT)<n>We show that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
arXiv Detail & Related papers (2024-12-22T08:28:20Z) - GaussianAnything: Interactive Point Cloud Flow Matching For 3D Object Generation [75.39457097832113]
This paper introduces a novel 3D generation framework, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space.<n>Our framework employs a Variational Autoencoder with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information.<n>The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single image inputs.
arXiv Detail & Related papers (2024-11-12T18:59:32Z) - DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation [53.20147419879056]
We introduce a diffusion-based feed-forward framework to address challenges with a single model.
Building upon our 3D-aware Diffusion model with TransFormer, we propose a stronger version for 3D generation, i.e., DiffTF++.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules.
arXiv Detail & Related papers (2024-05-13T17:59:51Z) - Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability [118.26563926533517]
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space.
We extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously.
arXiv Detail & Related papers (2024-02-19T15:33:09Z) - PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views
with Learnt Shape Programs [24.09764733540401]
We develop a new method to automatically convert 2D line drawings from three orthographic views into 3D CAD models.
We leverage the attention mechanism in a Transformer-based sequence generation model to learn flexible mappings between the input and output.
Our method significantly outperforms existing ones when the inputs are noisy or incomplete.
arXiv Detail & Related papers (2023-08-10T17:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.