MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D
Point Cloud Understanding
- URL: http://arxiv.org/abs/2402.10002v3
- Date: Sun, 25 Feb 2024 07:58:07 GMT
- Title: MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D
Point Cloud Understanding
- Authors: Hai-Tao Yu, Mofei Song
- Abstract summary: Multi-view 2D information can provide superior self-supervised signals for 3D objects.
MM-Point is driven by intra-modal and inter-modal similarity objectives.
It achieves a peak accuracy of 92.4% on the synthetic dataset ModelNet40, and a top accuracy of 87.8% on the real-world dataset ScanObjectNN.
- Score: 4.220064723125481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In perception, multiple sensory information is integrated to map visual
information from 2D views onto 3D objects, which is beneficial for
understanding in 3D environments. But in terms of a single 2D view rendered
from different angles, only limited partial information can be provided.The
richness and value of Multi-view 2D information can provide superior
self-supervised signals for 3D objects. In this paper, we propose a novel
self-supervised point cloud representation learning method, MM-Point, which is
driven by intra-modal and inter-modal similarity objectives. The core of
MM-Point lies in the Multi-modal interaction and transmission between 3D
objects and multiple 2D views at the same time. In order to more effectively
simultaneously perform the consistent cross-modal objective of 2D multi-view
information based on contrastive learning, we further propose Multi-MLP and
Multi-level Augmentation strategies. Through carefully designed transformation
strategies, we further learn Multi-level invariance in 2D Multi-views. MM-Point
demonstrates state-of-the-art (SOTA) performance in various downstream tasks.
For instance, it achieves a peak accuracy of 92.4% on the synthetic dataset
ModelNet40, and a top accuracy of 87.8% on the real-world dataset ScanObjectNN,
comparable to fully supervised methods. Additionally, we demonstrate its
effectiveness in tasks such as few-shot classification, 3D part segmentation
and 3D semantic segmentation.
Related papers
- Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation [19.2297264550686]
Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods.
We introduce Zero-Shot Dual-Path Integration Framework that equally values the contributions of both 3D and 2D modalities.
Our framework, utilizing pre-trained models in a zero-shot manner, is model-agnostic and demonstrates superior performance on both seen and unseen data.
arXiv Detail & Related papers (2024-08-16T07:52:00Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z) - SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and
Multi-View for 3D Object Retrieval [8.74845857766369]
Multi-modality 3D object retrieval is rarely developed and analyzed on large-scale datasets.
We propose self-and-cross attention based aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object retrieval.
arXiv Detail & Related papers (2023-07-20T05:46:32Z) - MMRDN: Consistent Representation for Multi-View Manipulation
Relationship Detection in Object-Stacked Scenes [62.20046129613934]
We propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN)
We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions.
We select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects.
arXiv Detail & Related papers (2023-04-25T05:55:29Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - Multimodal Semi-Supervised Learning for 3D Objects [19.409295848915388]
This paper explores how the coherence of different modelities of 3D data can be used to improve data efficiency for both 3D classification and retrieval tasks.
We propose a novel multimodal semi-supervised learning framework by introducing instance-level consistency constraint and a novel multimodal contrastive prototype (M2CP) loss.
Our proposed framework significantly outperforms all the state-of-the-art counterparts for both classification and retrieval tasks by a large margin on the modelNet10 and ModelNet40 datasets.
arXiv Detail & Related papers (2021-10-22T05:33:16Z) - Multi-Task Multi-Sensor Fusion for 3D Object Detection [93.68864606959251]
We present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion.
Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels.
arXiv Detail & Related papers (2020-12-22T22:49:15Z) - Self-supervised Feature Learning by Cross-modality and Cross-view
Correspondences [32.01548991331616]
This paper presents a novel self-supervised learning approach to learn both 2D image features and 3D point cloud features.
It exploits cross-modality and cross-view correspondences without using any annotated human labels.
The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks.
arXiv Detail & Related papers (2020-04-13T02:57:25Z) - MANet: Multimodal Attention Network based Point- View fusion for 3D
Shape Recognition [0.5371337604556311]
This paper proposes a fusion network based on multimodal attention mechanism for 3D shape recognition.
Considering the limitations of multi-view data, we introduce a soft attention scheme, which can use the global point-cloud features to filter the multi-view features.
More specifically, we obtain the enhanced multi-view features by mining the contribution of each multi-view image to the overall shape recognition.
arXiv Detail & Related papers (2020-02-28T07:00:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.