GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
- URL: http://arxiv.org/abs/2412.13193v1
- Date: Tue, 17 Dec 2024 18:59:46 GMT
- Title: GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
- Authors: Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin, Zhizhong Su, Wenyu Liu, Xinggang Wang,
- Abstract summary: We introduce GaussTR, a novel Gaussian Transformer to advance self-supervised 3D spatial understanding.
GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner.
Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance.
- Score: 44.68350305790145
- License:
- Abstract: 3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that leverages alignment with foundation models to advance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner. Through aligning rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations and enables open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50%. These experimental results highlight the significant potential of GaussTR for scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents. Code is available at https://github.com/hustvl/GaussTR.
Related papers
- GaussianAD: Gaussian-Centric End-to-End Autonomous Driving [23.71316979650116]
Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs.
Most existing methods adopt dense representations (e.g., bird's eye view) or sparse representations (e.g., instance boxes) for decision-making.
This paper explores a Gaussian-centric end-to-end autonomous driving framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene.
arXiv Detail & Related papers (2024-12-13T18:59:30Z) - GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction [55.60972844777044]
3D semantic occupancy prediction is an important task for robust vision-centric autonomous driving.
Most existing methods leverage dense grid-based scene representations, overlooking the spatial sparsity of the driving scenes.
We propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied.
arXiv Detail & Related papers (2024-12-05T17:59:58Z) - L3DG: Latent 3D Gaussian Diffusion [74.36431175937285]
L3DG is the first approach for generative 3D modeling of 3D Gaussians through a latent 3D Gaussian diffusion formulation.
We employ a sparse convolutional architecture to efficiently operate on room-scale scenes.
By leveraging the 3D Gaussian representation, the generated scenes can be rendered from arbitrary viewpoints in real-time.
arXiv Detail & Related papers (2024-10-17T13:19:32Z) - Atlas Gaussians Diffusion for 3D Generation [37.68480030996363]
latent diffusion model has proven effective in developing novel 3D generation techniques.
Key challenge is designing a high-fidelity and efficient representation that links the latent space and the 3D space.
We introduce Atlas Gaussians, a novel representation for feed-forward native 3D generation.
arXiv Detail & Related papers (2024-08-23T13:27:27Z) - ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining [104.34751911174196]
We build a large-scale dataset of 3DGS using ShapeNet and ModelNet datasets.
Our dataset ShapeSplat consists of 65K objects from 87 unique categories.
We introduce textbftextitGaussian-MAE, which highlights the unique benefits of representation learning from Gaussian parameters.
arXiv Detail & Related papers (2024-08-20T14:49:14Z) - GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction [70.65250036489128]
3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene.
We propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians.
GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
arXiv Detail & Related papers (2024-05-27T17:59:51Z) - 3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting [58.95801720309658]
In this paper, we present an implicit surface reconstruction method with 3D Gaussian Splatting (3DGS), namely 3DGSR.
The key insight is incorporating an implicit signed distance field (SDF) within 3D Gaussians to enable them to be aligned and jointly optimized.
Our experimental results demonstrate that our 3DGSR method enables high-quality 3D surface reconstruction while preserving the efficiency and rendering quality of 3DGS.
arXiv Detail & Related papers (2024-03-30T16:35:38Z) - Sparse-view CT Reconstruction with 3D Gaussian Volumetric Representation [13.667470059238607]
Sparse-view CT is a promising strategy for reducing the radiation dose of traditional CT scans.
Recently, 3D Gaussian has been applied to model complex natural scenes.
We investigate their potential for sparse-view CT reconstruction.
arXiv Detail & Related papers (2023-12-25T09:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.