G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer
- URL: http://arxiv.org/abs/2409.06322v1
- Date: Tue, 10 Sep 2024 08:27:19 GMT
- Title: G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer
- Authors: Jinzhi Zhang, Feng Xiong, Mu Xu,
- Abstract summary: We introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer.
Cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence.
Experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods.
- Score: 4.221298212125194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship between different levels suitable for autoregressive modeling. Additionally, the cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports diverse conditional structures, enabling the generation of 3D shapes from various types of conditions. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.
Related papers
- OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation [24.980804600194062]
OctGPT is a novel multiscale autoregressive model for 3D shape generation.
It dramatically improves the efficiency and performance of prior 3D autoregressive approaches.
It offers a new paradigm for high-quality, scalable 3D content creation.
arXiv Detail & Related papers (2025-04-14T08:31:26Z) - MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation [44.94438766074643]
We introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer.
Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens.
arXiv Detail & Related papers (2025-03-26T13:00:51Z) - GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation [24.255633621887988]
We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency.
Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation.
As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models.
arXiv Detail & Related papers (2025-03-08T09:10:31Z) - TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction [137.34863114016483]
TAR3D is a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT)
We show that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
arXiv Detail & Related papers (2024-12-22T08:28:20Z) - 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes [20.675695749508353]
We introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation.
By redefining 3D AR generation task as next-scale" prediction, we reduce the computational cost of generation.
Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD.
arXiv Detail & Related papers (2024-11-28T10:33:01Z) - Any-to-3D Generation via Hybrid Diffusion Supervision [67.54197818071464]
XBind is a unified framework for any-to-3D generation using cross-modal pre-alignment techniques.
XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities.
arXiv Detail & Related papers (2024-11-22T03:52:37Z) - GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction [52.04103235260539]
We present a diffusion model approach based on Gaussian Splatting representation for 3D object reconstruction from a single view.
The model learns to generate 3D objects represented by sets of GS ellipsoids.
The final reconstructed objects explicitly come with high-quality 3D structure and texture, and can be efficiently rendered in arbitrary views.
arXiv Detail & Related papers (2024-07-05T03:43:08Z) - VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation [69.68568248073747]
We propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks.
PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps.
For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details.
arXiv Detail & Related papers (2024-06-21T08:21:52Z) - DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation [53.20147419879056]
We introduce a diffusion-based feed-forward framework to address challenges with a single model.
Building upon our 3D-aware Diffusion model with TransFormer, we propose a stronger version for 3D generation, i.e., DiffTF++.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules.
arXiv Detail & Related papers (2024-05-13T17:59:51Z) - Interactive3D: Create What You Want by Interactive 3D Generation [13.003964182554572]
We introduce Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process.
Our experiments demonstrate that Interactive3D markedly improves the controllability and quality of 3D generation.
arXiv Detail & Related papers (2024-04-25T11:06:57Z) - Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability [118.26563926533517]
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space.
We extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously.
arXiv Detail & Related papers (2024-02-19T15:33:09Z) - Large-Vocabulary 3D Diffusion Model with Transformer [57.076986347047]
We introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model.
We propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance.
arXiv Detail & Related papers (2023-09-14T17:59:53Z) - Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically
Structured Sequences [11.09257948735229]
Autoregressive models have proven to be very powerful in NLP text generation tasks.
We introduce an adaptive compression scheme, that significantly reduces sequence lengths.
We demonstrate the performance of our model by comparing against the state-of-the-art in shape generation.
arXiv Detail & Related papers (2021-11-24T13:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.