Related papers: GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

URL: http://arxiv.org/abs/2412.13193v2
Date: Mon, 24 Mar 2025 12:45:56 GMT
Title: GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
Authors: Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin, Zhizhong Su, Wenyu Liu, Xinggang Wang,
Abstract summary: GaussTR is a novel Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding.<n>Experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance of 12.27 mIoU, along with a 40% reduction in training time.<n>These results highlight the efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in autonomous driving and embodied agents.
Score: 44.68350305790145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D Semantic Occupancy Prediction is fundamental for spatial understanding, yet existing approaches face challenges in scalability and generalization due to their reliance on extensive labeled data and computationally intensive voxel-wise representations. In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. GaussTR predicts sparse sets of Gaussians in a feed-forward manner to represent 3D scenes. By splatting the Gaussians into 2D views and aligning the rendered features with foundation models, GaussTR facilitates self-supervised 3D representation learning and enables open-vocabulary semantic occupancy prediction without requiring explicit annotations. Empirical experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance of 12.27 mIoU, along with a 40% reduction in training time. These results highlight the efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in autonomous driving and embodied agents. The code is available at https://github.com/hustvl/GaussTR.

Related papers

GaussianFormer3D: Multi-Modal Gaussian-based Semantic Occupancy Prediction with 3D Deformable Attention [15.890744831541452]
3D semantic occupancy prediction is critical for achieving safe and reliable autonomous driving.<n>We propose a multi-modal Gaussian-based semantic occupancy prediction framework utilizing 3D deformable attention.
arXiv Detail & Related papers (2025-05-15T20:05:08Z)
Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism [0.3277163122167433]
Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer proposed using 3D Gaussian to describe scenes with sparse 3D semantic Gaussian based on objects is another scheme with lower memory requirements. In the experiment, it is found that the Gaussian function required by this method is larger than the query resolution of the original dense grid network, resulting in impaired performance.
arXiv Detail & Related papers (2025-03-06T09:40:46Z)
GaussianAD: Gaussian-Centric End-to-End Autonomous Driving [23.71316979650116]
Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs. Most existing methods adopt dense representations (e.g., bird's eye view) or sparse representations (e.g., instance boxes) for decision-making. This paper explores a Gaussian-centric end-to-end autonomous driving framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene.
arXiv Detail & Related papers (2024-12-13T18:59:30Z)
GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction [55.60972844777044]
3D semantic occupancy prediction is an important task for robust vision-centric autonomous driving.<n>Most existing methods leverage dense grid-based scene representations, overlooking the spatial sparsity of the driving scenes.<n>We propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied.
arXiv Detail & Related papers (2024-12-05T17:59:58Z)
L3DG: Latent 3D Gaussian Diffusion [74.36431175937285]
L3DG is the first approach for generative 3D modeling of 3D Gaussians through a latent 3D Gaussian diffusion formulation. We employ a sparse convolutional architecture to efficiently operate on room-scale scenes. By leveraging the 3D Gaussian representation, the generated scenes can be rendered from arbitrary viewpoints in real-time.
arXiv Detail & Related papers (2024-10-17T13:19:32Z)
Atlas Gaussians Diffusion for 3D Generation [37.68480030996363]
latent diffusion model has proven effective in developing novel 3D generation techniques. Key challenge is designing a high-fidelity and efficient representation that links the latent space and the 3D space. We introduce Atlas Gaussians, a novel representation for feed-forward native 3D generation.
arXiv Detail & Related papers (2024-08-23T13:27:27Z)
GSGAN: Adversarial Learning for Hierarchical Generation of 3D Gaussian Splats [20.833116566243408]
In this paper, we exploit Gaussian as a 3D representation for 3D GANs by leveraging its efficient and explicit characteristics. We introduce a generator architecture with a hierarchical multi-scale Gaussian representation that effectively regularizes the position and scale of generated Gaussians. Experimental results demonstrate that ours achieves a significantly faster rendering speed (x100) compared to state-of-the-art 3D consistent GANs.
arXiv Detail & Related papers (2024-06-05T05:52:20Z)
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction [70.65250036489128]
3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene. We propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians. GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
arXiv Detail & Related papers (2024-05-27T17:59:51Z)
GaussianCube: A Structured and Explicit Radiance Representation for 3D Generative Modeling [55.05713977022407]
We introduce a radiance representation that is both structured and fully explicit and thus greatly facilitates 3D generative modeling. We derive GaussianCube by first using a novel densification-constrained Gaussian fitting algorithm, which yields high-accuracy fitting. Experiments conducted on unconditional and class-conditioned object generation, digital avatar creation, and text-to-3D all show that our model synthesis achieves state-of-the-art generation results.
arXiv Detail & Related papers (2024-03-28T17:59:50Z)
Mesh-based Gaussian Splatting for Real-time Large-scale Deformation [58.18290393082119]
It is challenging for users to directly deform or manipulate implicit representations with large deformations in the real-time fashion. We develop a novel GS-based method that enables interactive deformation. Our approach achieves high-quality reconstruction and effective deformation, while maintaining the promising rendering results at a high frame rate.
arXiv Detail & Related papers (2024-02-07T12:36:54Z)
Sparse-view CT Reconstruction with 3D Gaussian Volumetric Representation [13.667470059238607]
Sparse-view CT is a promising strategy for reducing the radiation dose of traditional CT scans. Recently, 3D Gaussian has been applied to model complex natural scenes. We investigate their potential for sparse-view CT reconstruction.
arXiv Detail & Related papers (2023-12-25T09:47:33Z)
GIR: 3D Gaussian Inverse Rendering for Relightable Scene Factorization [62.13932669494098]
This paper presents a 3D Gaussian Inverse Rendering (GIR) method, employing 3D Gaussian representations to factorize the scene into material properties, light, and geometry. We compute the normal of each 3D Gaussian using the shortest eigenvector, with a directional masking scheme forcing accurate normal estimation without external supervision. We adopt an efficient voxel-based indirect illumination tracing scheme that stores direction-aware outgoing radiance in each 3D Gaussian to disentangle secondary illumination for approximating multi-bounce light transport.
arXiv Detail & Related papers (2023-12-08T16:05:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.