Cross-BERT for Point Cloud Pretraining
- URL: http://arxiv.org/abs/2312.04891v1
- Date: Fri, 8 Dec 2023 08:18:12 GMT
- Title: Cross-BERT for Point Cloud Pretraining
- Authors: Xin Li, Peng Li, Zeyong Wei, Zhe Zhu, Mingqiang Wei, Junhui Hou,
Liangliang Nan, Jing Qin, Haoran Xie, and Fu Lee Wang
- Abstract summary: We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT.
To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction.
Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
- Score: 61.762046503448936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Introducing BERT into cross-modal settings raises difficulties in its
optimization for handling multiple modalities. Both the BERT architecture and
training objective need to be adapted to incorporate and model information from
different modalities. In this paper, we address these challenges by exploring
the implicit semantic and geometric correlations between 2D and 3D data of the
same objects/scenes. We propose a new cross-modal BERT-style self-supervised
learning paradigm, called Cross-BERT. To facilitate pretraining for irregular
and sparse point clouds, we design two self-supervised tasks to boost
cross-modal interaction. The first task, referred to as Point-Image Alignment,
aims to align features between unimodal and cross-modal representations to
capture the correspondences between the 2D and 3D modalities. The second task,
termed Masked Cross-modal Modeling, further improves mask modeling of BERT by
incorporating high-dimensional semantic information obtained by cross-modal
interaction. By performing cross-modal interaction, Cross-BERT can smoothly
reconstruct the masked tokens during pretraining, leading to notable
performance enhancements for downstream tasks. Through empirical evaluation, we
demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D
downstream applications. Our work highlights the effectiveness of leveraging
cross-modal 2D knowledge to strengthen 3D point cloud representation and the
transferable capability of BERT across modalities.
Related papers
- Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval [5.965791109321719]
Cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems.
We propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data.
arXiv Detail & Related papers (2024-08-11T07:03:21Z) - GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer [44.44603063754173]
Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities.
We propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations.
We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process.
arXiv Detail & Related papers (2024-06-03T11:24:15Z) - M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for
2D image and video understanding [5.989397492717352]
We present M$3$3D ($underlineM$ulti-$underlineM$odal $underlineM$asked $underline3D$) built based on Multi-modal masked autoencoders.
We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning.
Experiments show that M$3$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR.
arXiv Detail & Related papers (2023-09-26T23:52:09Z) - Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers [58.802352477207094]
We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities.
We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively.
Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
arXiv Detail & Related papers (2023-07-09T08:58:47Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Image Understands Point Cloud: Weakly Supervised 3D Semantic
Segmentation via Association Learning [59.64695628433855]
We propose a novel cross-modality weakly supervised method for 3D segmentation, incorporating complementary information from unlabeled images.
Basically, we design a dual-branch network equipped with an active labeling strategy, to maximize the power of tiny parts of labels.
Our method even outperforms the state-of-the-art fully supervised competitors with less than 1% actively selected annotations.
arXiv Detail & Related papers (2022-09-16T07:59:04Z) - CMD: Self-supervised 3D Action Representation Learning with Cross-modal
Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities.
We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs.
Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.