ViPFormer: Efficient Vision-and-Pointcloud Transformer for Unsupervised
Pointcloud Understanding
- URL: http://arxiv.org/abs/2303.14376v1
- Date: Sat, 25 Mar 2023 06:47:12 GMT
- Title: ViPFormer: Efficient Vision-and-Pointcloud Transformer for Unsupervised
Pointcloud Understanding
- Authors: Hongyu Sun, Yongcai Wang, Xudong Cai, Xuewei Bai and Deying Li
- Abstract summary: We propose a lightweight Vision-and-Pointcloud Transformer (ViPFormer) to unify image and point cloud processing in a single architecture.
ViPFormer learns in an unsupervised manner by optimizing intra-modal and cross-modal contrastive objectives.
Experiments on different datasets show ViPFormer surpasses previous state-of-the-art unsupervised methods with higher accuracy, lower model complexity and runtime latency.
- Score: 3.7966094046587786
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, a growing number of work design unsupervised paradigms for point
cloud processing to alleviate the limitation of expensive manual annotation and
poor transferability of supervised methods. Among them, CrossPoint follows the
contrastive learning framework and exploits image and point cloud data for
unsupervised point cloud understanding. Although the promising performance is
presented, the unbalanced architecture makes it unnecessarily complex and
inefficient. For example, the image branch in CrossPoint is $\sim$8.3x heavier
than the point cloud branch leading to higher complexity and latency. To
address this problem, in this paper, we propose a lightweight
Vision-and-Pointcloud Transformer (ViPFormer) to unify image and point cloud
processing in a single architecture. ViPFormer learns in an unsupervised manner
by optimizing intra-modal and cross-modal contrastive objectives. Then the
pretrained model is transferred to various downstream tasks, including 3D shape
classification and semantic segmentation. Experiments on different datasets
show ViPFormer surpasses previous state-of-the-art unsupervised methods with
higher accuracy, lower model complexity and runtime latency. Finally, the
effectiveness of each component in ViPFormer is validated by extensive ablation
studies. The implementation of the proposed method is available at
https://github.com/auniquesun/ViPFormer.
Related papers
- ModaLink: Unifying Modalities for Efficient Image-to-PointCloud Place Recognition [16.799067323119644]
We introduce a fast and lightweight framework to encode images and point clouds into place-distinctive descriptors.
We propose an effective Field of View (FoV) transformation module to convert point clouds into an analogous modality as images.
We also design a non-negative factorization-based encoder to extract mutually consistent semantic features between point clouds and images.
arXiv Detail & Related papers (2024-03-27T17:01:10Z) - PosDiffNet: Positional Neural Diffusion for Point Cloud Registration in
a Large Field of View with Perturbations [27.45001809414096]
PosDiffNet is a model for point cloud registration in 3D computer vision.
We leverage a graph neural partial differential equation (PDE) based on Beltrami flow to obtain high-dimensional features.
We employ the multi-level correspondence derived from the high feature similarity scores to facilitate alignment between point clouds.
We evaluate PosDiffNet on several 3D point cloud datasets, verifying that it achieves state-of-the-art (SOTA) performance for point cloud registration in large fields of view with perturbations.
arXiv Detail & Related papers (2024-01-06T08:58:15Z) - Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models [64.49254199311137]
We propose a novel Instance-aware Dynamic Prompt Tuning (IDPT) strategy for pre-trained point cloud models.
The essence of IDPT is to develop a dynamic prompt generation module to perceive semantic prior features of each point cloud instance.
In experiments, IDPT outperforms full fine-tuning in most tasks with a mere 7% of the trainable parameters.
arXiv Detail & Related papers (2023-04-14T16:03:09Z) - AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware
Transformers [94.11915008006483]
We present a new method that reformulates point cloud completion as a set-to-set translation problem.
We design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion.
Our method attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI.
arXiv Detail & Related papers (2023-01-11T16:14:12Z) - EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder [60.52613206271329]
This paper introduces textbfEfficient textbfPoint textbfCloud textbfLearning (EPCL) for training high-quality point cloud models with a frozen CLIP transformer.
Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data.
arXiv Detail & Related papers (2022-12-08T06:27:11Z) - Let Images Give You More:Point Cloud Cross-Modal Training for Shape
Analysis [43.13887916301742]
This paper introduces a simple but effective point cloud cross-modality training (PointCMT) strategy to boost point cloud analysis.
To effectively acquire auxiliary knowledge from view images, we develop a teacher-student framework and formulate the cross modal learning as a knowledge distillation problem.
We verify significant gains on various datasets using appealing backbones, i.e., equipped with PointCMT, PointNet++ and PointMLP.
arXiv Detail & Related papers (2022-10-09T09:35:22Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding [80.04281842702294]
We introduce the concept of the multi-view point cloud (Voint cloud) representing each 3D point as a set of features extracted from several view-points.
This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation.
We deploy a Voint neural network (VointNet) with a theoretically established functional form to learn representations in the Voint space.
arXiv Detail & Related papers (2021-11-30T13:08:19Z) - Unsupervised Representation Learning for 3D Point Cloud Data [66.92077180228634]
We propose a simple yet effective approach for unsupervised point cloud learning.
In particular, we identify a very useful transformation which generates a good contrastive version of an original point cloud.
We conduct experiments on three downstream tasks which are 3D object classification, shape part segmentation and scene segmentation.
arXiv Detail & Related papers (2021-10-13T10:52:45Z) - Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck [3.2790748006553643]
Given that point-wise semantic annotation is expensive, in this paper, we address the challenge of learning models with extremely sparse labels.
We propose a self-supervised 3D representation learning framework named viewpoint bottleneck.
arXiv Detail & Related papers (2021-09-17T13:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.