Epipolar Attention Field Transformers for Bird's Eye View Semantic Segmentation
- URL: http://arxiv.org/abs/2412.01595v1
- Date: Mon, 02 Dec 2024 15:15:10 GMT
- Title: Epipolar Attention Field Transformers for Bird's Eye View Semantic Segmentation
- Authors: Christian Witte, Jens Behley, Cyrill Stachniss, Marvin Raaijmakers,
- Abstract summary: This paper addresses the dependency on learned positional encodings to correlate image and BEV feature map elements for transformer-based methods.
We propose leveraging epipolar geometric constraints to model the relationship between cameras and the BEV by Epipolar Attention Fields.
Experiments show that our method EAFormer outperforms previous BEV approaches by 2% mIoU for map semantic segmentation.
- Score: 26.245188807280684
- License:
- Abstract: Spatial understanding of the semantics of the surroundings is a key capability needed by autonomous cars to enable safe driving decisions. Recently, purely vision-based solutions have gained increasing research interest. In particular, approaches extracting a bird's eye view (BEV) from multiple cameras have demonstrated great performance for spatial understanding. This paper addresses the dependency on learned positional encodings to correlate image and BEV feature map elements for transformer-based methods. We propose leveraging epipolar geometric constraints to model the relationship between cameras and the BEV by Epipolar Attention Fields. They are incorporated into the attention mechanism as a novel attribution term, serving as an alternative to learned positional encodings. Experiments show that our method EAFormer outperforms previous BEV approaches by 2% mIoU for map semantic segmentation and exhibits superior generalization capabilities compared to implicitly learning the camera configuration.
Related papers
- BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment [8.098296280937518]
We present BEVPose, a framework that integrates BEV representations from camera and lidar data, using sensor pose as a guiding supervisory signal.
By leveraging pose information, we align and fuse multi-modal sensory inputs, facilitating the learning of latent BEV embeddings that capture both geometric and semantic aspects of the environment.
arXiv Detail & Related papers (2024-10-28T12:40:27Z) - OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation [57.2213693781672]
Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems.
We propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance.
Our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation.
arXiv Detail & Related papers (2024-07-18T03:48:22Z) - Improving Bird's Eye View Semantic Segmentation by Task Decomposition [42.57351039508863]
We decompose the original BEV segmentation task into two stages, namely BEV map reconstruction and RGB-BEV feature alignment.
Our approach simplifies the complexity of combining perception and generation into distinct steps, equipping the model to handle intricate and challenging scenes effectively.
arXiv Detail & Related papers (2024-04-02T13:19:45Z) - An Efficient Transformer for Simultaneous Learning of BEV and Lane
Representations in 3D Lane Detection [55.281369497158515]
We propose an efficient transformer for 3D lane detection.
Different from the vanilla transformer, our model contains a cross-attention mechanism to simultaneously learn lane and BEV representations.
Our method obtains 2D and 3D lane predictions by applying the lane features to the image-view and BEV features, respectively.
arXiv Detail & Related papers (2023-06-08T04:18:31Z) - A Cross-Scale Hierarchical Transformer with Correspondence-Augmented
Attention for inferring Bird's-Eye-View Semantic Segmentation [13.013635162859108]
Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing.
We propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inferring.
Our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.
arXiv Detail & Related papers (2023-04-07T13:52:47Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view.
Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z) - Delving into the Devils of Bird's-eye-view Perception: A Review,
Evaluation and Recipe [115.31507979199564]
Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia.
As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.
The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios.
arXiv Detail & Related papers (2022-09-12T15:29:13Z) - LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic
Segmentation [43.12994451281451]
We present 'LaRa', an efficient encoder-decoder, transformer-based model for vehicle semantic segmentation from multiple cameras.
Our approach uses a system of cross-attention to aggregate information over multiple sensors into a compact, yet rich, collection of latent representations.
arXiv Detail & Related papers (2022-06-27T13:37:50Z) - GitNet: Geometric Prior-based Transformation for Birds-Eye-View
Segmentation [105.19949897812494]
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving.
We present a novel two-stage Geometry Prior-based Transformation framework named GitNet.
arXiv Detail & Related papers (2022-04-16T06:46:45Z) - BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary
Camera Rigs [3.5728676902207988]
We present an effective transformer-based method for BEV semantic segmentation from arbitrary camera rigs.
Specifically, our method first encodes image features from arbitrary cameras with a shared backbone.
An efficient multi-camera deformable attention unit is designed to carry out the BEV-to-image view transformation.
arXiv Detail & Related papers (2022-03-08T12:39:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.