Multimodal Semi-Supervised Learning for 3D Objects
- URL: http://arxiv.org/abs/2110.11601v2
- Date: Mon, 25 Oct 2021 02:35:34 GMT
- Title: Multimodal Semi-Supervised Learning for 3D Objects
- Authors: Zhimin Chen, Longlong Jing, Yang Liang, YingLi Tian, Bing Li
- Abstract summary: This paper explores how the coherence of different modelities of 3D data can be used to improve data efficiency for both 3D classification and retrieval tasks.
We propose a novel multimodal semi-supervised learning framework by introducing instance-level consistency constraint and a novel multimodal contrastive prototype (M2CP) loss.
Our proposed framework significantly outperforms all the state-of-the-art counterparts for both classification and retrieval tasks by a large margin on the modelNet10 and ModelNet40 datasets.
- Score: 19.409295848915388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, semi-supervised learning has been widely explored and shows
excellent data efficiency for 2D data. There is an emerging need to improve
data efficiency for 3D tasks due to the scarcity of labeled 3D data. This paper
explores how the coherence of different modelities of 3D data (e.g. point
cloud, image, and mesh) can be used to improve data efficiency for both 3D
classification and retrieval tasks. We propose a novel multimodal
semi-supervised learning framework by introducing instance-level consistency
constraint and a novel multimodal contrastive prototype (M2CP) loss. The
instance-level consistency enforces the network to generate consistent
representations for multimodal data of the same object regardless of its
modality. The M2CP maintains a multimodal prototype for each class and learns
features with small intra-class variations by minimizing the feature distance
of each object to its prototype while maximizing the distance to the others.
Our proposed framework significantly outperforms all the state-of-the-art
counterparts for both classification and retrieval tasks by a large margin on
the modelNet10 and ModelNet40 datasets.
Related papers
- LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.
Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.
Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z) - CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds [1.9643285694999641]
We propose Contrastive Learning for 3D large multimodal models via Odds ratio on high-Resolution point clouds.
CL3DOR achieves state-of-the-art performance in 3D scene understanding and reasoning benchmarks.
arXiv Detail & Related papers (2025-01-07T15:42:32Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.
We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.
GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D
Point Cloud Understanding [4.220064723125481]
Multi-view 2D information can provide superior self-supervised signals for 3D objects.
MM-Point is driven by intra-modal and inter-modal similarity objectives.
It achieves a peak accuracy of 92.4% on the synthetic dataset ModelNet40, and a top accuracy of 87.8% on the real-world dataset ScanObjectNN.
arXiv Detail & Related papers (2024-02-15T15:10:17Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training [44.790636524264]
Point Prompt Training is a novel framework for multi-dataset synergistic learning in the context of 3D representation learning.
It can overcome the negative transfer associated with synergistic learning and produce generalizable representations.
It achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training.
arXiv Detail & Related papers (2023-08-18T17:59:57Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.