MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition
- URL: http://arxiv.org/abs/2311.18402v3
- Date: Wed, 11 Sep 2024 12:21:04 GMT
- Title: MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition
- Authors: Dan Song, Xinwei Fu, Ning Liu, Weizhi Nie, Wenhui Li, Lanjun Wang, You Yang, Anan Liu,
- Abstract summary: Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios.
Recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition.
This paper aims to improve the confidence with view selection and hierarchical prompts.
- Score: 49.52436478739151
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Leveraging the CLIP model as an example, we employ view selection on the vision side by identifying views with high prediction confidence from multiple rendered views of a 3D shape. On the textual side, the strategy of hierarchical prompts is proposed for the first time. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44%, 91.51%, and 66.17% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.
Related papers
- Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels [69.55622471172941]
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models.
We propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model.
arXiv Detail & Related papers (2024-04-15T21:30:50Z) - CLIPose: Category-Level Object Pose Estimation with Pre-trained
Vision-Language Knowledge [18.57081150228812]
We propose a novel 6D pose framework that employs the pre-trained vision-language model to develop better learning of object category information.
CLIPose achieves state-of-the-art performance on two mainstream benchmark datasets, REAL275 and CAMERA25, and runs in real-time during inference (40FPS)
arXiv Detail & Related papers (2024-02-24T05:31:53Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z) - OpenShape: Scaling Up 3D Shape Representation Towards Open-World
Understanding [53.21204584976076]
We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds.
We scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions.
We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition.
arXiv Detail & Related papers (2023-05-18T07:07:19Z) - CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D
Dense CLIP [19.66617835750012]
Training a 3D scene understanding model requires complicated human annotations.
vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties.
We propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision.
arXiv Detail & Related papers (2023-03-08T17:30:58Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - Planes vs. Chairs: Category-guided 3D shape learning without any 3D cues [42.59825584255742]
We present a novel 3D shape reconstruction method which learns to predict an implicit 3D shape representation from a single RGB image.
Our approach uses a set of single-view images of multiple object categories without viewpoint annotation.
We are the first to examine and quantify the benefit of class information in single-view supervised 3D shape reconstruction.
arXiv Detail & Related papers (2022-04-21T16:13:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.