Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image
Transformers Help 3D Representation Learning?
- URL: http://arxiv.org/abs/2212.08320v1
- Date: Fri, 16 Dec 2022 07:46:53 GMT
- Title: Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image
Transformers Help 3D Representation Learning?
- Authors: Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng
Ge, Li Yi, Kaisheng Ma
- Abstract summary: We show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT)
Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN.
- Score: 30.59796205121887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of deep learning heavily relies on large-scale data with
comprehensive labels, which is more expensive and time-consuming to fetch in 3D
compared to 2D images or natural languages. This promotes the potential of
utilizing models pretrained with data more than 3D as teachers for cross-modal
knowledge transferring. In this paper, we revisit masked modeling in a unified
fashion of knowledge distillation, and we show that foundational Transformers
pretrained with 2D images or natural languages can help self-supervised 3D
representation learning through training Autoencoders as Cross-Modal Teachers
(ACT). The pretrained Transformers are transferred as cross-modal 3D teachers
using discrete variational autoencoding self-supervision, during which the
Transformers are frozen with prompt tuning for better knowledge inheritance.
The latent features encoded by the 3D teachers are used as the target of masked
point modeling, wherein the dark knowledge is distilled to the 3D Transformer
students as foundational geometry understanding. Our ACT pretrained 3D learner
achieves state-of-the-art generalization capacity across various downstream
benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN. Codes will be
released at https://github.com/RunpeiDong/ACT.
Related papers
- Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields [57.617972778377215]
We show how to generate effective 3D representations from posed RGB images.
We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images.
Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks.
arXiv Detail & Related papers (2024-04-01T17:59:55Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders [52.91248611338202]
We propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE.
By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding.
I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity.
arXiv Detail & Related papers (2022-12-13T17:59:20Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - Data Efficient 3D Learner via Knowledge Transferred from 2D Model [30.077342050473515]
We deal with the data scarcity challenge of 3D tasks by transferring knowledge from strong 2D models via RGB-D images.
We utilize a strong and well-trained semantic segmentation model for 2D images to augment RGB-D images with pseudo-label.
Our method already outperforms existing state-of-the-art that is tailored for 3D label efficiency.
arXiv Detail & Related papers (2022-03-16T09:14:44Z) - X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D
Dense Captioning [71.36623596807122]
3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds.
In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption.
arXiv Detail & Related papers (2022-03-02T03:35:37Z) - Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining [21.878815180924832]
We present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets.
Our experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.
arXiv Detail & Related papers (2021-04-10T05:40:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.