Related papers: X-Align++: cross-modal cross-view alignment for Bird's-eye-view segmentation

X-Align++: cross-modal cross-view alignment for Bird's-eye-view segmentation

URL: http://arxiv.org/abs/2306.03810v1
Date: Tue, 6 Jun 2023 15:52:55 GMT
Title: X-Align++: cross-modal cross-view alignment for Bird's-eye-view segmentation
Authors: Shubhankar Borse, Senthil Yogamani, Marvin Klingner, Varun Ravi, Hong Cai, Abdulaziz Almuzairee and Fatih Porikli
Abstract summary: X-Align is a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation. X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes and KITTI-360 datasets.
Score: 44.58686493878629
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bird's-eye-view (BEV) grid is a typical representation of the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. The latest works leverage both camera and LiDAR modalities but suboptimally fuse their features using simple, concatenation-based mechanisms. In this paper, we address these problems by enhancing the alignment of the unimodal features in order to aid feature fusion, as well as enhancing the alignment between the cameras' perspective view (PV) and BEV representations. We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel Cross-Modal Feature Alignment (X-FA) loss, (ii) an attention-based Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation. We evaluate our proposed method across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide extensive ablation studies to demonstrate the effectiveness of the individual components.

Related papers

PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation [42.879223792782334]
This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies. Our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives.
arXiv Detail & Related papers (2024-12-19T13:12:15Z)
LSSInst: Improving Geometric Modeling in LSS-Based BEV Perception with Instance Representation [10.434754671492723]
We propose LSSInst, a two-stage object detector incorporating BEV and instance representations in tandem. The proposed detector exploits fine-grained pixel-level features that can be flexibly integrated into existing LSS-based BEV networks. Our proposed framework is of excellent generalization ability and performance, which boosts the performances of modern LSS-based BEV perception methods without bells and whistles.
arXiv Detail & Related papers (2024-11-09T13:03:54Z)
OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation [57.2213693781672]
Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems. We propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance. Our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation.
arXiv Detail & Related papers (2024-07-18T03:48:22Z)
An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection [55.281369497158515]
We propose an efficient transformer for 3D lane detection. Different from the vanilla transformer, our model contains a cross-attention mechanism to simultaneously learn lane and BEV representations. Our method obtains 2D and 3D lane predictions by applying the lane features to the image-view and BEV features, respectively.
arXiv Detail & Related papers (2023-06-08T04:18:31Z)
Leveraging BEV Representation for 360-degree Visual Place Recognition [14.497501941931759]
This paper investigates the advantages of using Bird's Eye View representation in 360-degree visual place recognition (VPR) We propose a novel network architecture that utilizes the BEV representation in feature extraction, feature aggregation, and vision-LiDAR fusion. The proposed BEV-based method is evaluated in ablation and comparative studies on two datasets.
arXiv Detail & Related papers (2023-05-23T08:29:42Z)
A Cross-Scale Hierarchical Transformer with Correspondence-Augmented Attention for inferring Bird's-Eye-View Semantic Segmentation [13.013635162859108]
Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing. We propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inferring. Our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.
arXiv Detail & Related papers (2023-04-07T13:52:47Z)
Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections. We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing. We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z)
X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View Segmentation [44.95630790801856]
X-Align is a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation. X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes.
arXiv Detail & Related papers (2022-10-13T06:42:46Z)
CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers [36.838065731893735]
CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions. CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
arXiv Detail & Related papers (2022-07-05T17:59:28Z)
GitNet: Geometric Prior-based Transformation for Birds-Eye-View Segmentation [105.19949897812494]
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving. We present a novel two-stage Geometry Prior-based Transformation framework named GitNet.
arXiv Detail & Related papers (2022-04-16T06:46:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.