X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View
Segmentation
- URL: http://arxiv.org/abs/2210.06778v1
- Date: Thu, 13 Oct 2022 06:42:46 GMT
- Title: X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View
Segmentation
- Authors: Shubhankar Borse, Marvin Klingner, Varun Ravi Kumar, Hong Cai,
Abdulaziz Almuzairee, Senthil Yogamani, Fatih Porikli
- Abstract summary: X-Align is a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation.
X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes.
- Score: 44.95630790801856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bird's-eye-view (BEV) grid is a common representation for the perception of
road components, e.g., drivable area, in autonomous driving. Most existing
approaches rely on cameras only to perform segmentation in BEV space, which is
fundamentally constrained by the absence of reliable depth information. Latest
works leverage both camera and LiDAR modalities, but sub-optimally fuse their
features using simple, concatenation-based mechanisms.
In this paper, we address these problems by enhancing the alignment of the
unimodal features in order to aid feature fusion, as well as enhancing the
alignment between the cameras' perspective view (PV) and BEV representations.
We propose X-Align, a novel end-to-end cross-modal and cross-view learning
framework for BEV segmentation consisting of the following components: (i) a
novel Cross-Modal Feature Alignment (X-FA) loss, (ii) an attention-based
Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features
implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View
Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation.
We evaluate our proposed method across two commonly used benchmark datasets,
i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the
state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide
extensive ablation studies to demonstrate the effectiveness of the individual
components.
Related papers
- LSSInst: Improving Geometric Modeling in LSS-Based BEV Perception with Instance Representation [10.434754671492723]
We propose LSSInst, a two-stage object detector incorporating BEV and instance representations in tandem.
The proposed detector exploits fine-grained pixel-level features that can be flexibly integrated into existing LSS-based BEV networks.
Our proposed framework is of excellent generalization ability and performance, which boosts the performances of modern LSS-based BEV perception methods without bells and whistles.
arXiv Detail & Related papers (2024-11-09T13:03:54Z) - An Efficient Transformer for Simultaneous Learning of BEV and Lane
Representations in 3D Lane Detection [55.281369497158515]
We propose an efficient transformer for 3D lane detection.
Different from the vanilla transformer, our model contains a cross-attention mechanism to simultaneously learn lane and BEV representations.
Our method obtains 2D and 3D lane predictions by applying the lane features to the image-view and BEV features, respectively.
arXiv Detail & Related papers (2023-06-08T04:18:31Z) - X-Align++: cross-modal cross-view alignment for Bird's-eye-view
segmentation [44.58686493878629]
X-Align is a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation.
X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes and KITTI-360 datasets.
arXiv Detail & Related papers (2023-06-06T15:52:55Z) - Leveraging BEV Representation for 360-degree Visual Place Recognition [14.497501941931759]
This paper investigates the advantages of using Bird's Eye View representation in 360-degree visual place recognition (VPR)
We propose a novel network architecture that utilizes the BEV representation in feature extraction, feature aggregation, and vision-LiDAR fusion.
The proposed BEV-based method is evaluated in ablation and comparative studies on two datasets.
arXiv Detail & Related papers (2023-05-23T08:29:42Z) - A Cross-Scale Hierarchical Transformer with Correspondence-Augmented
Attention for inferring Bird's-Eye-View Semantic Segmentation [13.013635162859108]
Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing.
We propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inferring.
Our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.
arXiv Detail & Related papers (2023-04-07T13:52:47Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers [36.838065731893735]
CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions.
CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
arXiv Detail & Related papers (2022-07-05T17:59:28Z) - Voxel Field Fusion for 3D Object Detection [140.6941303279114]
We present a conceptually simple framework for cross-modality 3D object detection, named voxel field fusion.
The proposed approach aims to maintain cross-modality consistency by representing and fusing augmented image features as a ray in the voxel field.
The framework is demonstrated to achieve consistent gains in various benchmarks and outperforms previous fusion-based methods on KITTI and nuScenes datasets.
arXiv Detail & Related papers (2022-05-31T16:31:36Z) - GitNet: Geometric Prior-based Transformation for Birds-Eye-View
Segmentation [105.19949897812494]
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving.
We present a novel two-stage Geometry Prior-based Transformation framework named GitNet.
arXiv Detail & Related papers (2022-04-16T06:46:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.