Related papers: SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular Frontal View Images

SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular Frontal View Images

URL: http://arxiv.org/abs/2302.04233v1
Date: Wed, 8 Feb 2023 18:02:09 GMT
Title: SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular Frontal View Images
Authors: Nikhil Gosala, K\"ursat Petek, Paulo L. J. Drews-Jr, Wolfram Burgard, Abhinav Valada
Abstract summary: We propose the first self-supervised approach for generating a Bird's-Eye-View (BEV) semantic map using a single monocular image from the frontal view (FV) In training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences. Our approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in the BEV.
Score: 26.34702432184092
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bird's-Eye-View (BEV) semantic maps have become an essential component of automated driving pipelines due to the rich representation they provide for decision-making tasks. However, existing approaches for generating these maps still follow a fully supervised training paradigm and hence rely on large amounts of annotated BEV data. In this work, we address this limitation by proposing the first self-supervised approach for generating a BEV semantic map using a single monocular image from the frontal view (FV). During training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences. Thus, we propose the SkyEye architecture that learns based on two modes of self-supervision, namely, implicit supervision and explicit supervision. Implicit supervision trains the model by enforcing spatial consistency of the scene over time based on FV semantic sequences, while explicit supervision exploits BEV pseudolabels generated from FV semantic annotations and self-supervised depth estimates. Extensive evaluations on the KITTI-360 dataset demonstrate that our self-supervised approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in the BEV compared to fully supervised approaches. Finally, we publicly release both our code and the BEV datasets generated from the KITTI-360 and Waymo datasets.

Related papers

Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking [16.90910171943142]
Camera-based 3D object detection and tracking are essential for perception in autonomous driving.<n>Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features.<n>We propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features.
arXiv Detail & Related papers (2025-10-11T17:01:42Z)
RendBEV: Semantic Novel View Synthesis for Self-Supervised Bird's Eye View Segmentation [9.72227798086777]
We present RendBEV, a new method for the self-supervised training of Bird's Eye View semantic segmentation networks. Our method enables zero-shot BEV semantic segmentation, and already delivers competitive results.
arXiv Detail & Related papers (2025-02-20T18:11:44Z)
VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization [108.68014173017583]
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car. We propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens
arXiv Detail & Related papers (2024-11-03T16:09:47Z)
BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents [56.33989853438012]
We propose BEVWorld, a framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View latent space for holistic environment modeling. The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model.
arXiv Detail & Related papers (2024-07-08T07:26:08Z)
LetsMap: Unsupervised Representation Learning for Semantic BEV Mapping [23.366388601110913]
We propose the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation.
arXiv Detail & Related papers (2024-05-29T08:03:36Z)
Zero-BEV: Zero-shot Projection of Any First-Person Modality to BEV Maps [13.524499163234342]
We propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
arXiv Detail & Related papers (2024-02-21T14:50:24Z)
Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation [16.3996408206659]
We present a novel semi-supervised framework for visual BEV semantic segmentation to boost performance by exploiting unlabeled images during the training. A consistency loss that makes full use of unlabeled data is then proposed to constrain the model on not only semantic prediction but also the BEV feature. Experiments on the nuScenes and Argoverse datasets show that our framework can effectively improve prediction accuracy.
arXiv Detail & Related papers (2023-08-28T12:23:36Z)
Bird's-Eye-View Scene Graph for Vision-Language Navigation [85.72725920024578]
Vision-language navigation (VLN) entails an agent to navigate 3D environments following human instructions. We present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment. Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views.
arXiv Detail & Related papers (2023-08-09T07:48:20Z)
Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction [84.94140661523956]
We propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels.
arXiv Detail & Related papers (2023-02-15T17:58:10Z)
BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios [51.285561119993105]
We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving. Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation. We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
arXiv Detail & Related papers (2022-12-12T08:15:03Z)
BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision [101.36648828734646]
We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset.
arXiv Detail & Related papers (2022-11-18T18:59:48Z)
BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z)
Bird's-Eye-View Panoptic Segmentation Using Monocular Frontal View Images [4.449481309681663]
We present the first end-to-end learning approach for directly predicting dense panoptic segmentation maps in the Bird's-Eye-View (BEV) maps. Our architecture follows the top-down paradigm and incorporates a novel dense transformer module. We derive a mathematical formulation for the sensitivity of the FV-BEV transformation which allows us to intelligently weight pixels in the BEV space.
arXiv Detail & Related papers (2021-08-06T17:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.