CLIPose: Category-Level Object Pose Estimation with Pre-trained
Vision-Language Knowledge
- URL: http://arxiv.org/abs/2402.15726v1
- Date: Sat, 24 Feb 2024 05:31:53 GMT
- Title: CLIPose: Category-Level Object Pose Estimation with Pre-trained
Vision-Language Knowledge
- Authors: Xiao Lin, Minghao Zhu, Ronghao Dang, Guangliang Zhou, Shaolong Shu,
Feng Lin, Chengju Liu and Qijun Chen
- Abstract summary: We propose a novel 6D pose framework that employs the pre-trained vision-language model to develop better learning of object category information.
CLIPose achieves state-of-the-art performance on two mainstream benchmark datasets, REAL275 and CAMERA25, and runs in real-time during inference (40FPS)
- Score: 18.57081150228812
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most of existing category-level object pose estimation methods devote to
learning the object category information from point cloud modality. However,
the scale of 3D datasets is limited due to the high cost of 3D data collection
and annotation. Consequently, the category features extracted from these
limited point cloud samples may not be comprehensive. This motivates us to
investigate whether we can draw on knowledge of other modalities to obtain
category information. Inspired by this motivation, we propose CLIPose, a novel
6D pose framework that employs the pre-trained vision-language model to develop
better learning of object category information, which can fully leverage
abundant semantic knowledge in image and text modalities. To make the 3D
encoder learn category-specific features more efficiently, we align
representations of three modalities in feature space via multi-modal
contrastive learning. In addition to exploiting the pre-trained knowledge of
the CLIP's model, we also expect it to be more sensitive with pose parameters.
Therefore, we introduce a prompt tuning approach to fine-tune image encoder
while we incorporate rotations and translations information in the text
descriptions. CLIPose achieves state-of-the-art performance on two mainstream
benchmark datasets, REAL275 and CAMERA25, and runs in real-time during
inference (40FPS).
Related papers
- Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels [69.55622471172941]
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models.
We propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model.
arXiv Detail & Related papers (2024-04-15T21:30:50Z) - Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic
Segmentation [17.914290294935427]
Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set.
Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks.
We propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder.
arXiv Detail & Related papers (2023-12-12T12:35:59Z) - Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature
Aligned Pre-Training and Region-Aware Fine-tuning [55.517000360348725]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
Experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition [49.52436478739151]
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios.
Recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition.
This paper aims to improve the confidence with view selection and hierarchical prompts.
arXiv Detail & Related papers (2023-11-30T09:51:53Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D
Dense CLIP [19.66617835750012]
Training a 3D scene understanding model requires complicated human annotations.
vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties.
We propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision.
arXiv Detail & Related papers (2023-03-08T17:30:58Z) - 3D Point Cloud Pre-training with Knowledge Distillation from 2D Images [128.40422211090078]
We propose a knowledge distillation method for 3D point cloud pre-trained models to acquire knowledge directly from the 2D representation learning model.
Specifically, we introduce a cross-attention mechanism to extract concept features from 3D point cloud and compare them with the semantic information from 2D images.
In this scheme, the point cloud pre-trained models learn directly from rich information contained in 2D teacher models.
arXiv Detail & Related papers (2022-12-17T23:21:04Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.