X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D
Dense Captioning
- URL: http://arxiv.org/abs/2203.00843v1
- Date: Wed, 2 Mar 2022 03:35:37 GMT
- Title: X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D
Dense Captioning
- Authors: Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Zhen Li,
Shuguang Cui
- Abstract summary: 3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds.
In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption.
- Score: 71.36623596807122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D dense captioning aims to describe individual objects by natural language
in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point
clouds. However, only exploiting single modal information, e.g., point cloud,
previous approaches fail to produce faithful descriptions. Though aggregating
2D features into point clouds may be beneficial, it introduces an extra
computational burden, especially in inference phases. In this study, we
investigate a cross-modal knowledge transfer using Transformer for 3D dense
captioning, X-Trans2Cap, to effectively boost the performance of single-modal
3D caption through knowledge distillation using a teacher-student framework. In
practice, during the training phase, the teacher network exploits auxiliary 2D
modality and guides the student network that only takes point clouds as input
through the feature consistency constraints. Owing to the well-designed
cross-modal feature fusion module and the feature alignment in the training
phase, X-Trans2Cap acquires rich appearance information embedded in 2D images
with ease. Thus, a more faithful caption can be generated only using point
clouds during the inference. Qualitative and quantitative results confirm that
X-Trans2Cap outperforms previous state-of-the-art by a large margin, i.e.,
about +21 and about +16 absolute CIDEr score on ScanRefer and Nr3D datasets,
respectively.
Related papers
- Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration [107.61458720202984]
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes.
We propose the learnable transformation alignment to bridge the domain gap between image and point cloud data.
We establish dense 2D-3D correspondences to estimate the rigid pose.
arXiv Detail & Related papers (2024-01-23T02:41:06Z) - Cross-Modal Information-Guided Network using Contrastive Learning for
Point Cloud Registration [17.420425069785946]
We present a novel Cross-Modal Information-Guided Network (CMIGNet) for point cloud registration.
We first incorporate the projected images from the point clouds and fuse the cross-modal features using the attention mechanism.
We employ two contrastive learning strategies, namely overlapping contrastive learning and cross-modal contrastive learning.
arXiv Detail & Related papers (2023-11-02T12:56:47Z) - Intrinsic Image Decomposition Using Point Cloud Representation [13.771632868567277]
We introduce Point Intrinsic Net (PoInt-Net), which leverages 3D point cloud data to concurrently estimate albedo and shading maps.
PoInt-Net is efficient, achieving consistent performance across point clouds of any size with training only required on small-scale point clouds.
arXiv Detail & Related papers (2023-07-20T14:51:28Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - PointVST: Self-Supervised Pre-training for 3D Point Clouds via
View-Specific Point-to-Image Translation [64.858505571083]
This paper proposes a translative pre-training framework, namely PointVST.
It is driven by a novel self-supervised pretext task of cross-modal translation from 3D point clouds to their corresponding diverse forms of 2D rendered images.
arXiv Detail & Related papers (2022-12-29T07:03:29Z) - Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image
Transformers Help 3D Representation Learning? [30.59796205121887]
We show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT)
Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN.
arXiv Detail & Related papers (2022-12-16T07:46:53Z) - Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders [52.91248611338202]
We propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE.
By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding.
I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity.
arXiv Detail & Related papers (2022-12-13T17:59:20Z) - Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D
Image Representations [92.88108411154255]
We present a method that improves dense 2D image feature extractors when the latter are applied to the analysis of multiple images reconstructible as a 3D scene.
We show that our method not only enables semantic understanding in the context of scene-specific neural fields without the use of manual labels, but also consistently improves over the self-supervised 2D baselines.
arXiv Detail & Related papers (2022-09-07T23:24:09Z) - CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D
Point Cloud Understanding [2.8661021832561757]
CrossPoint is a simple cross-modal contrastive learning approach to learn transferable 3D point cloud representations.
Our approach outperforms the previous unsupervised learning methods on a diverse range of downstream tasks including 3D object classification and segmentation.
arXiv Detail & Related papers (2022-03-01T18:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.