An Initial Study of Bird's-Eye View Generation for Autonomous Vehicles using Cross-View Transformers
- URL: http://arxiv.org/abs/2508.12520v1
- Date: Sun, 17 Aug 2025 23:05:00 GMT
- Title: An Initial Study of Bird's-Eye View Generation for Autonomous Vehicles using Cross-View Transformers
- Authors: Felipe Carlos dos Santos, Eric Aislan Antonelo, Gustavo Claudio Karl Couto,
- Abstract summary: We employ CrossView Transformers (CVT) for learning to map camera images to three Bird's-Eye View (BEV) maps.<n>Our study examines generalization to unseen towns, the effect of different camera layouts, and two loss formulations.
- Score: 1.4474137122906163
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Bird's-Eye View (BEV) maps provide a structured, top-down abstraction that is crucial for autonomous-driving perception. In this work, we employ Cross-View Transformers (CVT) for learning to map camera images to three BEV's channels - road, lane markings, and planned trajectory - using a realistic simulator for urban driving. Our study examines generalization to unseen towns, the effect of different camera layouts, and two loss formulations (focal and L1). Using training data from only a town, a four-camera CVT trained with the L1 loss delivers the most robust test performance, evaluated in a new town. Overall, our results underscore CVT's promise for mapping camera inputs to reasonably accurate BEV maps.
Related papers
- Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking [16.90910171943142]
Camera-based 3D object detection and tracking are essential for perception in autonomous driving.<n>Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features.<n>We propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features.
arXiv Detail & Related papers (2025-10-11T17:01:42Z) - BEV-VLM: Trajectory Planning via Unified BEV Abstraction [6.603679803036061]
This paper introduces a novel framework for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual inputs.<n>Our method utilizes highly compressed and informative BEV representations, which are generated by fusing multi-modal sensor data (e.g., camera and LiDAR) and aligning them with HD Maps.<n> Experimental results on the nuScenes dataset demonstrate 44.8% improvements in planning accuracy and complete collision avoidance.
arXiv Detail & Related papers (2025-09-27T07:13:55Z) - RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View [3.165441652093544]
This paper systematically analyzes the key challenges in multi-camera BEV perception for roadside scenarios compared to vehicle-side.
RopeBEV introduces BEV augmentation to address the training balance issues caused by diverse camera poses.
Our method ranks 1st on the real-world highway dataset RoScenes.
arXiv Detail & Related papers (2024-09-18T05:16:34Z) - RoadBEV: Road Surface Reconstruction in Bird's Eye View [55.0558717607946]
Road surface conditions, especially geometry profiles, enormously affect driving performance of autonomous vehicles. Vision-based online road reconstruction promisingly captures road information in advance.
Bird's-Eye-View (BEV) perception provides immense potential to more reliable and accurate reconstruction.
This paper uniformly proposes two simple yet effective models for road elevation reconstruction in BEV named RoadBEV-mono and RoadBEV-stereo.
arXiv Detail & Related papers (2024-04-09T20:24:29Z) - An Efficient Transformer for Simultaneous Learning of BEV and Lane
Representations in 3D Lane Detection [55.281369497158515]
We propose an efficient transformer for 3D lane detection.
Different from the vanilla transformer, our model contains a cross-attention mechanism to simultaneously learn lane and BEV representations.
Our method obtains 2D and 3D lane predictions by applying the lane features to the image-view and BEV features, respectively.
arXiv Detail & Related papers (2023-06-08T04:18:31Z) - Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction [84.94140661523956]
We propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes.
We model each point in the 3D space by summing its projected features on the three planes.
Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels.
arXiv Detail & Related papers (2023-02-15T17:58:10Z) - Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years.
Data-driven simulation for autonomous driving has been a focal point of recent research.
We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z) - Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view.
Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z) - Structured Bird's-Eye-View Traffic Scene Understanding from Onboard
Images [128.881857704338]
We study the problem of extracting a directed graph representing the local road network in BEV coordinates, from a single onboard camera image.
We show that the method can be extended to detect dynamic objects on the BEV plane.
We validate our approach against powerful baselines and show that our network achieves superior performance.
arXiv Detail & Related papers (2021-10-05T12:40:33Z) - Monocular 3D Vehicle Detection Using Uncalibrated Traffic Cameras
through Homography [12.062095895630563]
This paper proposes a method to extract the position and pose of vehicles in the 3D world from a single traffic camera.
We observe that the homography between the road plane and the image plane is essential to 3D vehicle detection.
We propose a new regression target called textittailedr-box and a textitdual-view network architecture which boosts the detection accuracy on warped BEV images.
arXiv Detail & Related papers (2021-03-29T02:57:37Z) - S-BEV: Semantic Birds-Eye View Representation for Weather and Lighting
Invariant 3-DoF Localization [5.668124846154997]
We describe a light-weight, weather and lighting invariant, Semantic Bird's Eye View (S-BEV) signature for vision-based vehicle re-localization.
arXiv Detail & Related papers (2021-01-23T19:37:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.