Related papers: Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment

Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment

URL: http://arxiv.org/abs/2312.00663v2
Date: Wed, 19 Feb 2025 09:52:00 GMT
Title: Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment
Authors: Kangcheng Liu, Yong-Jin Liu, Baoquan Chen,
Abstract summary: This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
Score: 55.11291053011696
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck is that these models do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse real-world applications. Therefore, we are in urgent need of a framework that can simultaneously be applicable to both 3D point cloud segmentation and detection, particularly in the circumstances where the labels are rather scarce. This work presents a generalized and straightforward framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark on both the task of semantic segmentation and instance segmentation. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. The code is made publicly available at: https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing.

Related papers

Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding [31.40722103849691]
MPEC is a novel learning method for open-vocabulary 3D semantic segmentation. It uses both 3D entity-language alignment and point-entity consistency across different point cloud views. Our method achieves state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation.
arXiv Detail & Related papers (2025-04-28T05:43:14Z)
3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds [81.14476072159049]
3D Affordance detection is a challenging problem with broad applications on various robotic tasks. We reformulate the traditional affordance detection paradigm into textit Reasoning Affordance (IRAS) task. We propose 3D-ADLLM, a framework designed for reasoning affordance detection in 3D open-scene.
arXiv Detail & Related papers (2025-02-27T12:29:44Z)
ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models [57.57832348655715]
We propose a novel zero-shot approach for keypoint detection on 3D shapes. Our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models.
arXiv Detail & Related papers (2024-12-09T08:31:57Z)
Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns. A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z)
CLIPose: Category-Level Object Pose Estimation with Pre-trained Vision-Language Knowledge [18.57081150228812]
We propose a novel 6D pose framework that employs the pre-trained vision-language model to develop better learning of object category information. CLIPose achieves state-of-the-art performance on two mainstream benchmark datasets, REAL275 and CAMERA25, and runs in real-time during inference (40FPS)
arXiv Detail & Related papers (2024-02-24T05:31:53Z)
A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with Traditional/Learned 3D Descriptors [10.497309421830671]
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. This work presents a general and simple framework to tackle point cloud understanding when labels are limited.
arXiv Detail & Related papers (2023-12-03T02:51:54Z)
A Data-efficient Framework for Robotics Large-scale LiDAR Scene Parsing [10.497309421830671]
Existing state-of-the-art 3D point clouds understanding methods only perform well in a fully supervised manner. This work presents a general and simple framework to tackle point clouds understanding when labels are limited.
arXiv Detail & Related papers (2023-12-03T02:38:51Z)
Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task. Our approach involves making initial predictions of 2D semantic masks using different large vision models. To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z)
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z)
Weakly Supervised 3D Open-vocabulary Segmentation [104.07740741126119]
We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF) A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
arXiv Detail & Related papers (2023-05-23T14:16:49Z)
RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding [46.253711788685536]
We introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models. We devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning. Our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2% and 9.1% for semantic and instance segmentation.
arXiv Detail & Related papers (2023-04-03T13:30:04Z)
Box2Seg: Learning Semantics of 3D Point Clouds with Box-Level Supervision [65.19589997822155]
We introduce a neural architecture, termed Box2Seg, to learn point-level semantics of 3D point clouds with bounding box-level supervision. We show that the proposed network can be trained with cheap, or even off-the-shelf bounding box-level annotations and subcloud-level tags.
arXiv Detail & Related papers (2022-01-09T09:07:48Z)
Point Discriminative Learning for Unsupervised Representation Learning on 3D Point Clouds [54.31515001741987]
We propose a point discriminative learning method for unsupervised representation learning on 3D point clouds. We achieve this by imposing a novel point discrimination loss on the middle level and global level point features. Our method learns powerful representations and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2021-08-04T15:11:48Z)
Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN [4.924442315857227]
3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. We propose two approaches to improve the visual explanations and classification in 3D CNN.
arXiv Detail & Related papers (2020-12-17T12:29:16Z)
PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding [107.02479689909164]
In this work, we aim at facilitating research on 3D representation learning. We measure the effect of unsupervised pre-training on a large source set of 3D scenes.
arXiv Detail & Related papers (2020-07-21T17:59:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.