X-Align++: cross-modal cross-view alignment for Bird's-eye-view
segmentation
- URL: http://arxiv.org/abs/2306.03810v1
- Date: Tue, 6 Jun 2023 15:52:55 GMT
- Title: X-Align++: cross-modal cross-view alignment for Bird's-eye-view
segmentation
- Authors: Shubhankar Borse, Senthil Yogamani, Marvin Klingner, Varun Ravi, Hong
Cai, Abdulaziz Almuzairee and Fatih Porikli
- Abstract summary: X-Align is a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation.
X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes and KITTI-360 datasets.
- Score: 44.58686493878629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bird's-eye-view (BEV) grid is a typical representation of the perception of
road components, e.g., drivable area, in autonomous driving. Most existing
approaches rely on cameras only to perform segmentation in BEV space, which is
fundamentally constrained by the absence of reliable depth information. The
latest works leverage both camera and LiDAR modalities but suboptimally fuse
their features using simple, concatenation-based mechanisms. In this paper, we
address these problems by enhancing the alignment of the unimodal features in
order to aid feature fusion, as well as enhancing the alignment between the
cameras' perspective view (PV) and BEV representations. We propose X-Align, a
novel end-to-end cross-modal and cross-view learning framework for BEV
segmentation consisting of the following components: (i) a novel Cross-Modal
Feature Alignment (X-FA) loss, (ii) an attention-based Cross-Modal Feature
Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an
auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA)
losses to improve the PV-to-BEV transformation. We evaluate our proposed method
across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360.
Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute
mIoU points on nuScenes. We also provide extensive ablation studies to
demonstrate the effectiveness of the individual components.
Related papers
- LSSInst: Improving Geometric Modeling in LSS-Based BEV Perception with Instance Representation [10.434754671492723]
We propose LSSInst, a two-stage object detector incorporating BEV and instance representations in tandem.
The proposed detector exploits fine-grained pixel-level features that can be flexibly integrated into existing LSS-based BEV networks.
Our proposed framework is of excellent generalization ability and performance, which boosts the performances of modern LSS-based BEV perception methods without bells and whistles.
arXiv Detail & Related papers (2024-11-09T13:03:54Z) - OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation [57.2213693781672]
Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems.
We propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance.
Our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation.
arXiv Detail & Related papers (2024-07-18T03:48:22Z) - An Efficient Transformer for Simultaneous Learning of BEV and Lane
Representations in 3D Lane Detection [55.281369497158515]
We propose an efficient transformer for 3D lane detection.
Different from the vanilla transformer, our model contains a cross-attention mechanism to simultaneously learn lane and BEV representations.
Our method obtains 2D and 3D lane predictions by applying the lane features to the image-view and BEV features, respectively.
arXiv Detail & Related papers (2023-06-08T04:18:31Z) - Leveraging BEV Representation for 360-degree Visual Place Recognition [14.497501941931759]
This paper investigates the advantages of using Bird's Eye View representation in 360-degree visual place recognition (VPR)
We propose a novel network architecture that utilizes the BEV representation in feature extraction, feature aggregation, and vision-LiDAR fusion.
The proposed BEV-based method is evaluated in ablation and comparative studies on two datasets.
arXiv Detail & Related papers (2023-05-23T08:29:42Z) - A Cross-Scale Hierarchical Transformer with Correspondence-Augmented
Attention for inferring Bird's-Eye-View Semantic Segmentation [13.013635162859108]
Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing.
We propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inferring.
Our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.
arXiv Detail & Related papers (2023-04-07T13:52:47Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View
Segmentation [44.95630790801856]
X-Align is a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation.
X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes.
arXiv Detail & Related papers (2022-10-13T06:42:46Z) - CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers [36.838065731893735]
CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions.
CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
arXiv Detail & Related papers (2022-07-05T17:59:28Z) - GitNet: Geometric Prior-based Transformation for Birds-Eye-View
Segmentation [105.19949897812494]
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving.
We present a novel two-stage Geometry Prior-based Transformation framework named GitNet.
arXiv Detail & Related papers (2022-04-16T06:46:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.