Related papers: GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting

GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting

URL: http://arxiv.org/abs/2508.02172v1
Date: Mon, 04 Aug 2025 08:12:44 GMT
Title: GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting
Authors: Lei Yao, Yi Wang, Yi Zhang, Moyun Liu, Lap-Pui Chau,
Abstract summary: We present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture.<n>GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation.<n>It achieves superior performance through linear probing (0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods.
Score: 16.179607149692398
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP$_{50}$ on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at \href{https://rayyoh.github.io/GaussianCross/}{https://rayyoh.github.io/GaussianCross/}.

Related papers

Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding [86.55824709875598]
We propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches.<n>Unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor to capture fine-grained 3D shape details.<n>We employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations.
arXiv Detail & Related papers (2026-01-05T18:33:50Z)
GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering [50.675710727721786]
We propose GauSSmart, a hybrid method that bridges 2D foundational models and 3D Gaussian Splatting reconstruction.<n>Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision.<n>We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting.
arXiv Detail & Related papers (2025-10-16T03:38:26Z)
LabelGS: Label-Aware 3D Gaussian Splatting for 3D Scene Segmentation [56.4321049923868]
3D Gaussian Splatting (3DGS) has emerged as a novel explicit representation for 3D scenes, offering both high-fidelity reconstruction and efficient rendering.<n>We propose Label-aware 3D Gaussian Splatting (LabelGS), a method that augments the Gaussian representation with object label.<n>LabelGS achieves a remarkable 22X speedup in training compared to Feature-3DGS, at a resolution of 1440X1080.
arXiv Detail & Related papers (2025-08-27T09:07:38Z)
3DGEER: Exact and Efficient Volumetric Rendering with 3D Gaussians [15.776720879897345]
We introduce 3DGEER, an Exact and Efficient Volumetric Gaussian Rendering method.<n>Our method consistently outperforms prior methods, establishing a new state-of-the-art in real-time neural rendering.
arXiv Detail & Related papers (2025-05-29T22:52:51Z)
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding [44.68350305790145]
GaussTR is a novel Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding.<n>Experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance of 12.27 mIoU, along with a 40% reduction in training time.<n>These results highlight the efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in autonomous driving and embodied agents.
arXiv Detail & Related papers (2024-12-17T18:59:46Z)
GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction [55.60972844777044]
3D semantic occupancy prediction is an important task for robust vision-centric autonomous driving.<n>Most existing methods leverage dense grid-based scene representations, overlooking the spatial sparsity of the driving scenes.<n>We propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied.
arXiv Detail & Related papers (2024-12-05T17:59:58Z)
ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining [104.34751911174196]
We build a large-scale dataset of 3DGS using ShapeNet and ModelNet datasets. Our dataset ShapeSplat consists of 65K objects from 87 unique categories. We introduce textbftextitGaussian-MAE, which highlights the unique benefits of representation learning from Gaussian parameters.
arXiv Detail & Related papers (2024-08-20T14:49:14Z)
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction [70.65250036489128]
3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene. We propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians. GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
arXiv Detail & Related papers (2024-05-27T17:59:51Z)
CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding [17.440124130814166]
Exploiting 3D Gaussian Splatting (3DGS) with Contrastive Language-Image Pre-Training (CLIP) models for open-vocabulary 3D semantic understanding of indoor scenes has emerged as an attractive research focus.<n>We present CLIP-GS, efficiently achieving a coherent semantic understanding of 3D indoor scenes via the proposed Semantic Attribute Compactness (SAC) and 3D Coherent Regularization (3DCR)<n>Our method remarkably suppresses existing state-of-the-art approaches, achieving mIoU improvements of 21.20% and 13.05% on ScanNet and Replica datasets, respectively
arXiv Detail & Related papers (2024-04-22T15:01:32Z)
3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting [58.95801720309658]
In this paper, we present an implicit surface reconstruction method with 3D Gaussian Splatting (3DGS), namely 3DGSR.<n>The key insight is incorporating an implicit signed distance field (SDF) within 3D Gaussians to enable them to be aligned and jointly optimized.<n>Our experimental results demonstrate that our 3DGSR method enables high-quality 3D surface reconstruction while preserving the efficiency and rendering quality of 3DGS.
arXiv Detail & Related papers (2024-03-30T16:35:38Z)
latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction [48.86083272054711]
latentSplat is a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.
arXiv Detail & Related papers (2024-03-24T20:48:36Z)
Learning Segmented 3D Gaussians via Efficient Feature Unprojection for Zero-shot Neural Scene Segmentation [16.57158278095853]
Zero-shot neural scene segmentation serves as an effective way for scene understanding. Existing models, especially the efficient 3D Gaussian-based methods, struggle to produce compact segmentation results. Our work proposes the Feature Unprojection and Fusion module as the segmentation field. We show that our model surpasses baselines on zero-shot semantic segmentation task, improving by 10% mIoU over the best baseline.
arXiv Detail & Related papers (2024-01-11T14:05:01Z)
GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting [51.96353586773191]
We introduce textbfGS-SLAM that first utilizes 3D Gaussian representation in the Simultaneous Localization and Mapping system. Our method utilizes a real-time differentiable splatting rendering pipeline that offers significant speedup to map optimization and RGB-D rendering. Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the Replica, TUM-RGBD datasets.
arXiv Detail & Related papers (2023-11-20T12:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.