PEVA-Net: Prompt-Enhanced View Aggregation Network for Zero/Few-Shot Multi-View 3D Shape Recognition
- URL: http://arxiv.org/abs/2404.19168v1
- Date: Tue, 30 Apr 2024 00:16:59 GMT
- Title: PEVA-Net: Prompt-Enhanced View Aggregation Network for Zero/Few-Shot Multi-View 3D Shape Recognition
- Authors: Dongyun Lin, Yi Cheng, Shangbo Mao, Aiyuan Guo, Yiqun Li,
- Abstract summary: We focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition.
We propose Prompt-Enhanced View Aggregation Network (PEVA-Net) to simultaneously address zero/few-shot 3D shape recognition.
- Score: 8.15444057380305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models have impressively promote the performance of 2D visual recognition under zero/few-shot scenarios. In this paper, we focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition based on multi-view representations. The key challenge for both tasks is to generate a discriminative descriptor of the 3D shape represented by multiple view images under the scenarios of either without explicit training (zero-shot 3D shape recognition) or training with a limited number of data (few-shot 3D shape recognition). We analyze that both tasks are relevant and can be considered simultaneously. Specifically, leveraging the descriptor which is effective for zero-shot inference to guide the tuning of the aggregated descriptor under the few-shot training can significantly improve the few-shot learning efficacy. Hence, we propose Prompt-Enhanced View Aggregation Network (PEVA-Net) to simultaneously address zero/few-shot 3D shape recognition. Under the zero-shot scenario, we propose to leverage the prompts built up from candidate categories to enhance the aggregation process of multiple view-associated visual features. The resulting aggregated feature serves for effective zero-shot recognition of the 3D shapes. Under the few-shot scenario, we first exploit a transformer encoder to aggregate the view-associated visual features into a global descriptor. To tune the encoder, together with the main classification loss, we propose a self-distillation scheme via a feature distillation loss by treating the zero-shot descriptor as the guidance signal for the few-shot descriptor. This scheme can significantly enhance the few-shot learning efficacy.
Related papers
- Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Renderers are Good Zero-Shot Representation Learners: Exploring
Diffusion Latents for Metric Learning [1.0152838128195467]
We use retrieval as a proxy for measuring the metric learning properties of the latent spaces of Shap-E.
We find that Shap-E representations outperform those of the classical EfficientNet baseline representations zero-shot.
These findings give preliminary indication that 3D-based rendering and generative models can yield useful representations for discriminative tasks in our innately 3D-native world.
arXiv Detail & Related papers (2023-06-19T06:41:44Z) - PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained
Image-Language Models [56.324516906160234]
Generalizable 3D part segmentation is important but challenging in vision and robotics.
This paper explores an alternative way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-language model, GLIP.
We transfer the rich knowledge from 2D to 3D through GLIP-based part detection on point cloud rendering and a novel 2D-to-3D label lifting algorithm.
arXiv Detail & Related papers (2022-12-03T06:59:01Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding [80.04281842702294]
We introduce the concept of the multi-view point cloud (Voint cloud) representing each 3D point as a set of features extracted from several view-points.
This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation.
We deploy a Voint neural network (VointNet) with a theoretically established functional form to learn representations in the Voint space.
arXiv Detail & Related papers (2021-11-30T13:08:19Z) - Vis2Mesh: Efficient Mesh Reconstruction from Unstructured Point Clouds
of Large Scenes with Learned Virtual View Visibility [17.929307870456416]
We present a novel framework for mesh reconstruction from unstructured point clouds.
We take advantage of the learned visibility of the 3D points in the virtual views and traditional graph-cut based mesh generation.
arXiv Detail & Related papers (2021-08-18T20:28:16Z) - End-to-End Learning Local Multi-view Descriptors for 3D Point Clouds [67.27510166559563]
We propose an end-to-end framework to learn local multi-view descriptors for 3D point clouds.
Our method outperforms existing local descriptors both quantitatively and qualitatively.
arXiv Detail & Related papers (2020-03-12T15:41:34Z) - MANet: Multimodal Attention Network based Point- View fusion for 3D
Shape Recognition [0.5371337604556311]
This paper proposes a fusion network based on multimodal attention mechanism for 3D shape recognition.
Considering the limitations of multi-view data, we introduce a soft attention scheme, which can use the global point-cloud features to filter the multi-view features.
More specifically, we obtain the enhanced multi-view features by mining the contribution of each multi-view image to the overall shape recognition.
arXiv Detail & Related papers (2020-02-28T07:00:14Z) - Learning and Matching Multi-View Descriptors for Registration of Point
Clouds [48.25586496457587]
We first propose a multi-view local descriptor, which is learned from the images of multiple views, for the description of 3D keypoints.
Then, we develop a robust matching approach, aiming at rejecting outlier matches based on the efficient inference.
We have demonstrated the boost of our approaches to registration on the public scanning and multi-view stereo datasets.
arXiv Detail & Related papers (2018-07-16T01:58:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.