GS-CLIP: Gaussian Splatting for Contrastive Language-Image-3D
Pretraining from Real-World Data
- URL: http://arxiv.org/abs/2402.06198v2
- Date: Tue, 13 Feb 2024 15:33:41 GMT
- Title: GS-CLIP: Gaussian Splatting for Contrastive Language-Image-3D
Pretraining from Real-World Data
- Authors: Haoyuan Li, Yanpeng Zhou, Yihan Zeng, Hang Xu, Xiaodan Liang
- Abstract summary: 3D Shape represented as point cloud has achieve advancements in multimodal pre-training to align image and language descriptions.
We propose GS-CLIP for the first attempt to introduce 3DGS into multimodal pre-training to enhance 3D representation.
- Score: 73.06536202251915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D Shape represented as point cloud has achieve advancements in multimodal
pre-training to align image and language descriptions, which is curial to
object identification, classification, and retrieval. However, the discrete
representations of point cloud lost the object's surface shape information and
creates a gap between rendering results and 2D correspondences. To address this
problem, we propose GS-CLIP for the first attempt to introduce 3DGS (3D
Gaussian Splatting) into multimodal pre-training to enhance 3D representation.
GS-CLIP leverages a pre-trained vision-language model for a learned common
visual and textual space on massive real world image-text pairs and then learns
a 3D Encoder for aligning 3DGS optimized per object. Additionally, a novel
Gaussian-Aware Fusion is proposed to extract and fuse global explicit feature.
As a general framework for language-image-3D pre-training, GS-CLIP is agnostic
to 3D backbone networks. Experiments on challenging shows that GS-CLIP
significantly improves the state-of-the-art, outperforming the previously best
results.
Related papers
- OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies [112.80292725951921]
textbfOVGaussian is a generalizable textbfOpen-textbfVocabulary 3D semantic segmentation framework based on the 3D textbfGaussian representation.
We first construct a large-scale 3D scene dataset based on 3DGS, dubbed textbfSegGaussian, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images.
To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a
arXiv Detail & Related papers (2024-12-31T07:55:35Z) - CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting [88.24743308058441]
We present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS.
We develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations.
arXiv Detail & Related papers (2024-12-26T09:54:25Z) - GSemSplat: Generalizable Semantic 3D Gaussian Splatting from Uncalibrated Image Pairs [33.74118487769923]
We introduce GSemSplat, a framework that learns semantic representations linked to 3D Gaussians without per-scene optimization, dense image collections or calibration.
We employ a dual-feature approach that leverages both region-specific and context-aware semantic features as supervision in the 2D space.
arXiv Detail & Related papers (2024-12-22T09:06:58Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.