RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving
- URL: http://arxiv.org/abs/2301.10222v2
- Date: Tue, 25 Apr 2023 13:11:42 GMT
- Title: RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving
- Authors: Angelika Ando, Spyros Gidaris, Andrei Bursuc, Gilles Puy, Alexandre
Boulch, Renaud Marlet
- Abstract summary: Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
- Score: 80.14669385741202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem,
e.g., via range projection, is an effective and popular approach. These
projection-based methods usually benefit from fast computations and, when
combined with techniques which use other point cloud representations, achieve
state-of-the-art results. Today, projection-based methods leverage 2D CNNs but
recent advances in computer vision show that vision transformers (ViTs) have
achieved state-of-the-art results in many image-based benchmarks. In this work,
we question if projection-based methods for 3D semantic segmentation can
benefit from these latest improvements on ViTs. We answer positively but only
after combining them with three key ingredients: (a) ViTs are notoriously hard
to train and require a lot of training data to learn powerful representations.
By preserving the same backbone architecture as for RGB images, we can exploit
the knowledge from long training on large image collections that are much
cheaper to acquire and annotate than point clouds. We reach our best results
with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of
inductive bias by substituting a tailored convolutional stem for the classical
linear embedding layer. (c) We refine pixel-wise predictions with a
convolutional decoder and a skip connection from the convolutional stem to
combine low-level but fine-grained features of the the convolutional stem with
the high-level but coarse predictions of the ViT encoder. With these
ingredients, we show that our method, called RangeViT, outperforms existing
projection-based methods on nuScenes and SemanticKITTI. The code is available
at https://github.com/valeoai/rangevit.
Related papers
- Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery [0.0]
Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision.
This paper focuses on the comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID.
arXiv Detail & Related papers (2024-11-14T00:18:04Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D
Reconstruction with Transformers [37.14235383028582]
We introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference.
Our method utilizes two transformer-based networks, namely a point decoder and a triplane decoder, to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation.
arXiv Detail & Related papers (2023-12-14T17:18:34Z) - A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers [0.0]
We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition.
We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
arXiv Detail & Related papers (2022-10-03T12:08:09Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.