Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters
- URL: http://arxiv.org/abs/2503.03299v1
- Date: Wed, 05 Mar 2025 09:30:49 GMT
- Title: Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters
- Authors: Julia Hindel, Rohit Mohan, Jelena Bratulic, Daniele Cattaneo, Thomas Brox, Abhinav Valada,
- Abstract summary: BALViT is a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders.<n>We make the code and models publicly available at: http://balvit.cs.uni-freiburg.de.
- Score: 32.21090169762889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LiDAR semantic segmentation models are typically trained from random initialization as universal pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird's-eye-view LiDAR encoding mechanisms, which we combine through a novel 2D-3D adapter. While the range-view features are processed through a frozen image backbone, our bird's-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong label-efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes. We make the code and models publicly available at: http://balvit.cs.uni-freiburg.de.
Related papers
- Mapping and Localization Using LiDAR Fiducial Markers [0.8702432681310401]
dissertation proposes a novel framework for mapping and localization using LiDAR fiducial markers.
An Intensity Image-based LiDAR Fiducial Marker (IFM) system is introduced, using thin, letter-sized markers compatible with visual fiducial markers.
New LFM-based mapping and localization method registers unordered, low-overlap point clouds.
arXiv Detail & Related papers (2025-02-05T17:33:59Z) - LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.<n>Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.<n>Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z) - Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process.
Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z) - Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception [17.11366229887873]
We introduce a unified pretraining strategy, NeRF-Supervised Masked Auto (NS-MAE)
NS-MAE exploits NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data.
Results: NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality.
arXiv Detail & Related papers (2024-05-28T08:13:49Z) - Weak-to-Strong 3D Object Detection with X-Ray Distillation [75.47580744933724]
We propose a versatile technique that seamlessly integrates into any existing framework for 3D Object Detection.
X-Ray Distillation with Object-Complete Frames is suitable for both supervised and semi-supervised settings.
Our proposed methods surpass state-of-the-art in semi-supervised learning by 1-1.5 mAP.
arXiv Detail & Related papers (2024-03-31T13:09:06Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - MAELi: Masked Autoencoder for Large-Scale LiDAR Point Clouds [13.426810473131642]
Masked AutoEncoder for LiDAR point clouds (MAELi) intuitively leverages the sparsity of LiDAR point clouds in both the encoder and decoder during reconstruction.
In a novel reconstruction approach, MAELi distinguishes between empty and occluded space.
Thereby, without any ground truth whatsoever and trained on single frames only, MAELi obtains an understanding of the underlying 3D scene geometry and semantics.
arXiv Detail & Related papers (2022-12-14T13:10:27Z) - Image Understands Point Cloud: Weakly Supervised 3D Semantic
Segmentation via Association Learning [59.64695628433855]
We propose a novel cross-modality weakly supervised method for 3D segmentation, incorporating complementary information from unlabeled images.
Basically, we design a dual-branch network equipped with an active labeling strategy, to maximize the power of tiny parts of labels.
Our method even outperforms the state-of-the-art fully supervised competitors with less than 1% actively selected annotations.
arXiv Detail & Related papers (2022-09-16T07:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.