Faster Training, Fewer Labels: Self-Supervised Pretraining for Fine-Grained BEV Segmentation
- URL: http://arxiv.org/abs/2602.18066v1
- Date: Fri, 20 Feb 2026 08:37:58 GMT
- Title: Faster Training, Fewer Labels: Self-Supervised Pretraining for Fine-Grained BEV Segmentation
- Authors: Daniel Busch, Christian Bohn, Thomas Kurbiel, Klaus Friedrichs, Richard Meyes, Tobias Meisen,
- Abstract summary: We propose a two-phase training strategy for fine-grained road marking segmentation.<n>During the self-supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane.<n>The subsequent supervised fine-tuning phase requires only 50% of the dataset and significantly less training time.
- Score: 6.399280002773129
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Dense Bird's Eye View (BEV) semantic maps are central to autonomous driving, yet current multi-camera methods depend on costly, inconsistently annotated BEV ground truth. We address this limitation with a two-phase training strategy for fine-grained road marking segmentation that removes full supervision during pretraining and halves the amount of training data during fine-tuning while still outperforming the comparable supervised baseline model. During the self-supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane and trained against multi-view semantic pseudo-labels generated by the widely used semantic segmentation model Mask2Former. A temporal loss encourages consistency across frames. The subsequent supervised fine-tuning phase requires only 50% of the dataset and significantly less training time. With our method, the fine-tuning benefits from rich priors learned during pretraining boosting the performance and BEV segmentation quality (up to +2.5pp mIoU over the fully supervised baseline) on nuScenes. It simultaneously halves the usage of annotation data and reduces total training time by up to two thirds. The results demonstrate that differentiable reprojection plus camera perspective pseudo labels yields transferable BEV features and a scalable path toward reduced-label autonomous perception.
Related papers
- RESAR-BEV: An Explainable Progressive Residual Autoregressive Approach for Camera-Radar Fusion in BEV Segmentation [4.043972974168962]
Bird's-Eye-View (BEV) semantic segmentation provides comprehensive environmental perception for autonomous driving.<n>We propose RESAR-BEV, a progressive refinement framework that advances beyond single-step end-to-end approaches.<n> Experiments on nuScenes demonstrate RESAR-BEV state-of-the-art performance with 54.0% mIoU across 7 essential driving-scene categories.
arXiv Detail & Related papers (2025-05-10T05:10:07Z) - Should VLMs be Pre-trained with Image Data? [54.50406730361859]
We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks.<n>On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.
arXiv Detail & Related papers (2025-03-10T17:58:19Z) - Unified Human Localization and Trajectory Prediction with Monocular Vision [64.19384064365431]
MonoTransmotion is a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks.<n>We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs.
arXiv Detail & Related papers (2025-03-05T14:18:39Z) - RendBEV: Semantic Novel View Synthesis for Self-Supervised Bird's Eye View Segmentation [9.72227798086777]
We present RendBEV, a new method for the self-supervised training of Bird's Eye View semantic segmentation networks.<n>Our method enables zero-shot BEV semantic segmentation, and already delivers competitive results.
arXiv Detail & Related papers (2025-02-20T18:11:44Z) - LetsMap: Unsupervised Representation Learning for Semantic BEV Mapping [23.366388601110913]
We propose the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner.
Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner.
We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation.
arXiv Detail & Related papers (2024-05-29T08:03:36Z) - U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization [81.76044207714637]
Relocalization is essential for intelligent vehicles when GPS reception is insufficient or sensor-based localization fails.<n>Recent advances in Bird's-Eye-View (BEV) segmentation allow for accurate estimation of local scene appearance.<n>This paper presents U-BEV, a U-Net inspired architecture that extends the current state-of-the-art by allowing the BEV to reason about the scene on multiple height layers before flattening the BEV features.
arXiv Detail & Related papers (2023-10-20T18:57:38Z) - Semi-Supervised Learning for Visual Bird's Eye View Semantic
Segmentation [16.3996408206659]
We present a novel semi-supervised framework for visual BEV semantic segmentation to boost performance by exploiting unlabeled images during the training.
A consistency loss that makes full use of unlabeled data is then proposed to constrain the model on not only semantic prediction but also the BEV feature.
Experiments on the nuScenes and Argoverse datasets show that our framework can effectively improve prediction accuracy.
arXiv Detail & Related papers (2023-08-28T12:23:36Z) - SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular
Frontal View Images [26.34702432184092]
We propose the first self-supervised approach for generating a Bird's-Eye-View (BEV) semantic map using a single monocular image from the frontal view (FV)
In training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences.
Our approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in the BEV.
arXiv Detail & Related papers (2023-02-08T18:02:09Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - Two-phase Pseudo Label Densification for Self-training based Domain
Adaptation [93.03265290594278]
We propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD.
In the first phase, we use sliding window voting to propagate the confident predictions, utilizing intrinsic spatial-correlations in the images.
In the second phase, we perform a confidence-based easy-hard classification.
To ease the training process and avoid noisy predictions, we introduce the bootstrapping mechanism to the original self-training loss.
arXiv Detail & Related papers (2020-12-09T02:35:25Z) - Improving Semantic Segmentation via Self-Training [75.07114899941095]
We show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm.
We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets.
arXiv Detail & Related papers (2020-04-30T17:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.