Related papers: CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

URL: http://arxiv.org/abs/2412.19142v1
Date: Thu, 26 Dec 2024 09:54:25 GMT
Title: CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting
Authors: Siyu Jiao, Haoye Dong, Yuyang Yin, Zequn Jie, Yinlong Qian, Yao Zhao, Humphrey Shi, Yunchao Wei,
Abstract summary: We present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS.<n>We develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations.
Score: 88.24743308058441
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent works in 3D multimodal learning have made remarkable progress. However, typically 3D multimodal models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.

Related papers

UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting [64.31900521467362]
No existing pre-training method is equally effective for both object- and scene-level point clouds.<n>We introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture.
arXiv Detail & Related papers (2025-06-11T17:23:21Z)
Interpretable Single-View 3D Gaussian Splatting using Unsupervised Hierarchical Disentangled Representation Learning [46.85417907244265]
We propose an interpretable single-view 3DGS framework, termed 3DisGS, to discover both coarse- and fine-grained 3D semantics. Our model achieves 3D disentanglement while preserving high-quality and rapid reconstruction.
arXiv Detail & Related papers (2025-04-05T14:42:13Z)
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting [68.37013525040891]
We propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation. We demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation.
arXiv Detail & Related papers (2025-02-25T05:10:22Z)
Hyperbolic Contrastive Learning for Hierarchical 3D Point Cloud Embedding [21.50985015159827]
We extend the 3D Point Cloud modality in hyperbolic multi-modal contrastive pre-training. We also explore the entailment, modality gap, and alignment regularizers for learning hierarchical 3D embeddings.
arXiv Detail & Related papers (2025-01-04T13:27:18Z)
GS-PT: Exploiting 3D Gaussian Splatting for Comprehensive Point Cloud Understanding via Self-supervised Learning [15.559369116540097]
Self-supervised learning of point cloud aims to leverage unlabeled 3D data to learn meaningful representations without reliance on manual annotations. We propose GS-PT, which integrates 3D Gaussian Splatting (3DGS) into point cloud self-supervised learning for the first time. Our pipeline utilizes transformers as the backbone for self-supervised pre-training and introduces novel contrastive learning tasks through 3DGS.
arXiv Detail & Related papers (2024-09-08T03:46:47Z)
LP-3DGS: Learning to Prune 3D Gaussian Splatting [71.97762528812187]
We propose learning-to-prune 3DGS, where a trainable binary mask is applied to the importance score that can find optimal pruning ratio automatically. Experiments have shown that LP-3DGS consistently produces a good balance that is both efficient and high quality.
arXiv Detail & Related papers (2024-05-29T05:58:34Z)
GS-CLIP: Gaussian Splatting for Contrastive Language-Image-3D Pretraining from Real-World Data [73.06536202251915]
3D Shape represented as point cloud has achieve advancements in multimodal pre-training to align image and language descriptions. We propose GS-CLIP for the first attempt to introduce 3DGS into multimodal pre-training to enhance 3D representation.
arXiv Detail & Related papers (2024-02-09T05:46:47Z)
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z)
PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models [56.324516906160234]
Generalizable 3D part segmentation is important but challenging in vision and robotics. This paper explores an alternative way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-language model, GLIP. We transfer the rich knowledge from 2D to 3D through GLIP-based part detection on point cloud rendering and a novel 2D-to-3D label lifting algorithm.
arXiv Detail & Related papers (2022-12-03T06:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.