Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding
- URL: http://arxiv.org/abs/2505.18819v1
- Date: Sat, 24 May 2025 18:26:30 GMT
- Title: Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding
- Authors: Guofeng Mei, Bin Ren, Juan Liu, Luigi Riz, Xiaoshui Huang, Xu Zheng, Yongshun Gong, Ming-Hsuan Yang, Nicu Sebe, Fabio Poiesi,
- Abstract summary: We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone.<n>S4Token is a tokenization pipeline that produces semantically-informed tokens regardless of scene scale.
- Score: 87.68271178167373
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D multi-view image features. For dense prediction tasks, we propose a superpoint-level feature propagation module to recover point-level detail from sparse tokens.
Related papers
- OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting [52.40697058096931]
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for neural scene reconstruction.<n>We introduce an approach for open-vocabulary 3D instance segmentation without requiring manual labeling, termed OpenSplat3D.<n>We show results on LERF-mask and LERF-OVS as well as the full ScanNet++ validation set, demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2025-06-09T12:37:15Z) - Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models [9.658828841170472]
This work presents a rigorous study of 3D token structures, systematically comparing video-based and point-based representations.<n>We propose a novel approach that enriches visual tokens by incorporating 3D point cloud features from a Sonata pretrained Point Transformer V3 encoder.
arXiv Detail & Related papers (2025-06-06T02:35:26Z) - OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies [112.80292725951921]
textbfOVGaussian is a generalizable textbfOpen-textbfVocabulary 3D semantic segmentation framework based on the 3D textbfGaussian representation.<n>We first construct a large-scale 3D scene dataset based on 3DGS, dubbed textbfSegGaussian, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images.<n>To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a
arXiv Detail & Related papers (2024-12-31T07:55:35Z) - OpenSU3D: Open World 3D Scene Understanding using Foundation Models [2.1262749936758216]
We present a novel, scalable approach for constructing open set, instance-level 3D scene representations.
Existing methods require pre-constructed 3D scenes and face scalability issues due to per-point feature vector learning.
We evaluate our proposed approach on multiple scenes from ScanNet and Replica datasets demonstrating zero-shot generalization capabilities.
arXiv Detail & Related papers (2024-07-19T13:01:12Z) - AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans [41.17467024268349]
Making sense of 3D environments requires fine-grained scene understanding.
We propose to predict instance segmentations for 3D scenes in an unsupervised way.
Our approach attains 13.3% higher Average Precision and 9.1% higher F1 score compared to the best-performing baseline.
arXiv Detail & Related papers (2024-03-24T22:53:16Z) - PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models [51.24979014650188]
We present PointSeg, a training-free paradigm that leverages off-the-shelf vision foundation models to address 3D scene perception tasks.
PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to align their corresponding pixels across frames.
Our approach significantly surpasses the state-of-the-art specialist training-free model by 14.1$%$, 12.3$%$, and 12.6$%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets.
arXiv Detail & Related papers (2024-03-11T03:28:20Z) - Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic
Segmentation [17.914290294935427]
Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set.
Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks.
We propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder.
arXiv Detail & Related papers (2023-12-12T12:35:59Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - U3DS$^3$: Unsupervised 3D Semantic Scene Segmentation [19.706172244951116]
This paper presents U3DS$3$, as a step towards completely unsupervised point cloud segmentation for any holistic 3D scenes.
The initial step of our proposed approach involves generating superpoints based on the geometric characteristics of each scene.
We then undergo a learning process through a spatial clustering-based methodology, followed by iterative training using pseudo-labels generated in accordance with the cluster centroids.
arXiv Detail & Related papers (2023-11-10T12:05:35Z) - CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [55.864132158596206]
Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning.
We make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding.
We propose CLIP2Scene, a framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network.
arXiv Detail & Related papers (2023-01-12T10:42:39Z) - ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes [55.689763519293464]
ConDor is a self-supervised method that learns to canonicalize the 3D orientation and position for full and partial 3D point clouds.
During inference, our method takes an unseen full or partial 3D point cloud at an arbitrary pose and outputs an equivariant canonical pose.
arXiv Detail & Related papers (2022-01-19T18:57:21Z) - DH3D: Deep Hierarchical 3D Descriptors for Robust Large-Scale 6DoF
Relocalization [56.15308829924527]
We propose a Siamese network that jointly learns 3D local feature detection and description directly from raw 3D points.
For detecting 3D keypoints we predict the discriminativeness of the local descriptors in an unsupervised manner.
Experiments on various benchmarks demonstrate that our method achieves competitive results for both global point cloud retrieval and local point cloud registration.
arXiv Detail & Related papers (2020-07-17T20:21:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.