Three Pillars improving Vision Foundation Model Distillation for Lidar
- URL: http://arxiv.org/abs/2310.17504v2
- Date: Mon, 19 Feb 2024 20:19:37 GMT
- Title: Three Pillars improving Vision Foundation Model Distillation for Lidar
- Authors: Gilles Puy, Spyros Gidaris, Alexandre Boulch, Oriane Sim\'eoni,
Corentin Sautier, Patrick P\'erez, Andrei Bursuc, Renaud Marlet
- Abstract summary: We study the effect of three pillars for distillation: the 3D backbone, the pretrained 2D backbones, and the pretraining dataset.
Thanks to our scalable distillation method named ScaLR, we show that scaling the 2D and 3D backbones and pretraining on diverse datasets leads to a substantial improvement of the feature quality.
- Score: 61.56521056618988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised image backbones can be used to address complex 2D tasks
(e.g., semantic segmentation, object discovery) very efficiently and with
little or no downstream supervision. Ideally, 3D backbones for lidar should be
able to inherit these properties after distillation of these powerful 2D
features. The most recent methods for image-to-lidar distillation on autonomous
driving data show promising results, obtained thanks to distillation methods
that keep improving. Yet, we still notice a large performance gap when
measuring the quality of distilled and fully supervised features by linear
probing. In this work, instead of focusing only on the distillation method, we
study the effect of three pillars for distillation: the 3D backbone, the
pretrained 2D backbones, and the pretraining dataset. In particular, thanks to
our scalable distillation method named ScaLR, we show that scaling the 2D and
3D backbones and pretraining on diverse datasets leads to a substantial
improvement of the feature quality. This allows us to significantly reduce the
gap between the quality of distilled and fully-supervised 3D features, and to
improve the robustness of the pretrained backbones to domain gaps and
perturbations.
Related papers
- Image-to-Lidar Relational Distillation for Autonomous Driving Data [4.893568782260855]
2D foundation models excel at addressing 2D tasks with little or no downstream supervision, owing to their robust representations.
The emergence of 2D-to-3D distillation frameworks has extended these capabilities to 3D models.
But distilling 3D representations for autonomous driving datasets presents challenges like self-similarity, class imbalance, and point cloud sparsity.
We propose a relational distillation framework enforcing intra-modal and cross-modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation.
arXiv Detail & Related papers (2024-09-01T21:26:32Z) - Improving 2D Feature Representations by 3D-Aware Fine-Tuning [17.01280751430423]
Current visual foundation models are trained purely on unstructured 2D data.
We show that fine-tuning on 3D-aware data improves the quality of emerging semantic features.
arXiv Detail & Related papers (2024-07-29T17:59:21Z) - DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions [41.55908366474901]
We introduce a novel approach that harnesses both 2D and 3D attentions to enable highly accurate depth completion.
We evaluate our method, DeCoTR, on established depth completion benchmarks.
arXiv Detail & Related papers (2024-03-18T19:22:55Z) - 3D Point Cloud Pre-training with Knowledge Distillation from 2D Images [128.40422211090078]
We propose a knowledge distillation method for 3D point cloud pre-trained models to acquire knowledge directly from the 2D representation learning model.
Specifically, we introduce a cross-attention mechanism to extract concept features from 3D point cloud and compare them with the semantic information from 2D images.
In this scheme, the point cloud pre-trained models learn directly from rich information contained in 2D teacher models.
arXiv Detail & Related papers (2022-12-17T23:21:04Z) - RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects [68.85305626324694]
Ray-marching in Camera Space (RiCS) is a new method to represent the self-occlusions of foreground objects in 3D into a 2D self-occlusion map.
We show that our representation map not only allows us to enhance the image quality but also to model temporally coherent complex shadow effects.
arXiv Detail & Related papers (2022-05-14T05:35:35Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D
Human Pose Estimation [107.07047303858664]
Large-scale human datasets with 3D ground-truth annotations are difficult to obtain in the wild.
We address this problem by augmenting existing 2D datasets with high-quality 3D pose fits.
The resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks.
arXiv Detail & Related papers (2020-04-07T20:21:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.