G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer
- URL: http://arxiv.org/abs/2409.06322v1
- Date: Tue, 10 Sep 2024 08:27:19 GMT
- Title: G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer
- Authors: Jinzhi Zhang, Feng Xiong, Mu Xu,
- Abstract summary: We introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer.
Cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence.
Experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods.
- Score: 4.221298212125194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship between different levels suitable for autoregressive modeling. Additionally, the cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports diverse conditional structures, enabling the generation of 3D shapes from various types of conditions. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.
Related papers
- TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models [69.0220314849478]
TripoSG is a new paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images.
The resulting 3D shapes exhibit en- hanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input im- ages.
To foster progress and innovation in the field of 3D generation, we will make our model publicly available.
arXiv Detail & Related papers (2025-02-10T16:07:54Z) - F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Consistent Gaussian Splatting [35.625593119642424]
This paper tackles the problem of generalizable 3D-aware generation from monocular datasets.
We propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting.
We also introduce a self-supervised cycle-consistent constraint to enforce cross-view consistency in the learned 3D representation.
arXiv Detail & Related papers (2025-01-12T04:44:44Z) - TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction [137.34863114016483]
TAR3D is a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT)
We show that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
arXiv Detail & Related papers (2024-12-22T08:28:20Z) - Structured 3D Latents for Scalable and Versatile 3D Generation [28.672494137267837]
We introduce a novel 3D generation method for versatile and high-quality 3D asset creation.
The cornerstone is a unified Structured LATent representation which allows decoding to different output formats.
This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model.
arXiv Detail & Related papers (2024-12-02T13:58:38Z) - 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes [20.675695749508353]
We introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation.
By redefining 3D AR generation task as next-scale" prediction, we reduce the computational cost of generation.
Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD.
arXiv Detail & Related papers (2024-11-28T10:33:01Z) - Any-to-3D Generation via Hybrid Diffusion Supervision [67.54197818071464]
XBind is a unified framework for any-to-3D generation using cross-modal pre-alignment techniques.
XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities.
arXiv Detail & Related papers (2024-11-22T03:52:37Z) - VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation [69.68568248073747]
We propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks.
PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps.
For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details.
arXiv Detail & Related papers (2024-06-21T08:21:52Z) - DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation [53.20147419879056]
We introduce a diffusion-based feed-forward framework to address challenges with a single model.
Building upon our 3D-aware Diffusion model with TransFormer, we propose a stronger version for 3D generation, i.e., DiffTF++.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules.
arXiv Detail & Related papers (2024-05-13T17:59:51Z) - Large-Vocabulary 3D Diffusion Model with Transformer [57.076986347047]
We introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model.
We propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance.
arXiv Detail & Related papers (2023-09-14T17:59:53Z) - Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically
Structured Sequences [11.09257948735229]
Autoregressive models have proven to be very powerful in NLP text generation tasks.
We introduce an adaptive compression scheme, that significantly reduces sequence lengths.
We demonstrate the performance of our model by comparing against the state-of-the-art in shape generation.
arXiv Detail & Related papers (2021-11-24T13:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.