Related papers: GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

URL: http://arxiv.org/abs/2412.13193v1
Date: Tue, 17 Dec 2024 18:59:46 GMT
Title: GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
Authors: Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin, Zhizhong Su, Wenyu Liu, Xinggang Wang,
Abstract summary: We introduce GaussTR, a novel Gaussian Transformer to advance self-supervised 3D spatial understanding.<n>GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner.<n> Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance.
Score: 44.68350305790145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that leverages alignment with foundation models to advance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner. Through aligning rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations and enables open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50%. These experimental results highlight the significant potential of GaussTR for scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents. Code is available at https://github.com/hustvl/GaussTR.

Related papers

GaussianFormer3D: Multi-Modal Gaussian-based Semantic Occupancy Prediction with 3D Deformable Attention [15.890744831541452]
3D semantic occupancy prediction is critical for achieving safe and reliable autonomous driving.<n>We propose a multi-modal Gaussian-based semantic occupancy prediction framework utilizing 3D deformable attention.
arXiv Detail & Related papers (2025-05-15T20:05:08Z)
Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism [0.3277163122167433]
Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer proposed using 3D Gaussian to describe scenes with sparse 3D semantic Gaussian based on objects is another scheme with lower memory requirements. In the experiment, it is found that the Gaussian function required by this method is larger than the query resolution of the original dense grid network, resulting in impaired performance.
arXiv Detail & Related papers (2025-03-06T09:40:46Z)
GaussianAD: Gaussian-Centric End-to-End Autonomous Driving [23.71316979650116]
Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs. Most existing methods adopt dense representations (e.g., bird's eye view) or sparse representations (e.g., instance boxes) for decision-making. This paper explores a Gaussian-centric end-to-end autonomous driving framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene.
arXiv Detail & Related papers (2024-12-13T18:59:30Z)
GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction [55.60972844777044]
3D semantic occupancy prediction is an important task for robust vision-centric autonomous driving.<n>Most existing methods leverage dense grid-based scene representations, overlooking the spatial sparsity of the driving scenes.<n>We propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied.
arXiv Detail & Related papers (2024-12-05T17:59:58Z)
L3DG: Latent 3D Gaussian Diffusion [74.36431175937285]
L3DG is the first approach for generative 3D modeling of 3D Gaussians through a latent 3D Gaussian diffusion formulation. We employ a sparse convolutional architecture to efficiently operate on room-scale scenes. By leveraging the 3D Gaussian representation, the generated scenes can be rendered from arbitrary viewpoints in real-time.
arXiv Detail & Related papers (2024-10-17T13:19:32Z)
Atlas Gaussians Diffusion for 3D Generation [37.68480030996363]
latent diffusion model has proven effective in developing novel 3D generation techniques. Key challenge is designing a high-fidelity and efficient representation that links the latent space and the 3D space. We introduce Atlas Gaussians, a novel representation for feed-forward native 3D generation.
arXiv Detail & Related papers (2024-08-23T13:27:27Z)
GSGAN: Adversarial Learning for Hierarchical Generation of 3D Gaussian Splats [20.833116566243408]
In this paper, we exploit Gaussian as a 3D representation for 3D GANs by leveraging its efficient and explicit characteristics. We introduce a generator architecture with a hierarchical multi-scale Gaussian representation that effectively regularizes the position and scale of generated Gaussians. Experimental results demonstrate that ours achieves a significantly faster rendering speed (x100) compared to state-of-the-art 3D consistent GANs.
arXiv Detail & Related papers (2024-06-05T05:52:20Z)
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction [70.65250036489128]
3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene. We propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians. GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
arXiv Detail & Related papers (2024-05-27T17:59:51Z)
GaussianCube: A Structured and Explicit Radiance Representation for 3D Generative Modeling [55.05713977022407]
We introduce a radiance representation that is both structured and fully explicit and thus greatly facilitates 3D generative modeling. We derive GaussianCube by first using a novel densification-constrained Gaussian fitting algorithm, which yields high-accuracy fitting. Experiments conducted on unconditional and class-conditioned object generation, digital avatar creation, and text-to-3D all show that our model synthesis achieves state-of-the-art generation results.
arXiv Detail & Related papers (2024-03-28T17:59:50Z)
Mesh-based Gaussian Splatting for Real-time Large-scale Deformation [58.18290393082119]
It is challenging for users to directly deform or manipulate implicit representations with large deformations in the real-time fashion. We develop a novel GS-based method that enables interactive deformation. Our approach achieves high-quality reconstruction and effective deformation, while maintaining the promising rendering results at a high frame rate.
arXiv Detail & Related papers (2024-02-07T12:36:54Z)
Sparse-view CT Reconstruction with 3D Gaussian Volumetric Representation [13.667470059238607]
Sparse-view CT is a promising strategy for reducing the radiation dose of traditional CT scans. Recently, 3D Gaussian has been applied to model complex natural scenes. We investigate their potential for sparse-view CT reconstruction.
arXiv Detail & Related papers (2023-12-25T09:47:33Z)
GIR: 3D Gaussian Inverse Rendering for Relightable Scene Factorization [62.13932669494098]
This paper presents a 3D Gaussian Inverse Rendering (GIR) method, employing 3D Gaussian representations to factorize the scene into material properties, light, and geometry. We compute the normal of each 3D Gaussian using the shortest eigenvector, with a directional masking scheme forcing accurate normal estimation without external supervision. We adopt an efficient voxel-based indirect illumination tracing scheme that stores direction-aware outgoing radiance in each 3D Gaussian to disentangle secondary illumination for approximating multi-bounce light transport.
arXiv Detail & Related papers (2023-12-08T16:05:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.