CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers
- URL: http://arxiv.org/abs/2207.02202v1
- Date: Tue, 5 Jul 2022 17:59:28 GMT
- Title: CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers
- Authors: Runsheng Xu, Zhengzhong Tu, Hao Xiang, Wei Shao, Bolei Zhou, Jiaqi Ma
- Abstract summary: CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions.
CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
- Score: 36.838065731893735
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Bird's eye view (BEV) semantic segmentation plays a crucial role in spatial
sensing for autonomous driving. Although recent literature has made significant
progress on BEV map understanding, they are all based on single-agent
camera-based systems which are difficult to handle occlusions and detect
distant objects in complex traffic scenes. Vehicle-to-Vehicle (V2V)
communication technologies have enabled autonomous vehicles to share sensing
information, which can dramatically improve the perception performance and
range as compared to single-agent systems. In this paper, we propose CoBEVT,
the first generic multi-agent multi-camera perception framework that can
cooperatively generate BEV map predictions. To efficiently fuse camera features
from multi-view and multi-agent data in an underlying Transformer architecture,
we design a fused axial attention or FAX module, which can capture sparsely
local and global spatial interactions across views and agents. The extensive
experiments on the V2V perception dataset, OPV2V, demonstrate that CoBEVT
achieves state-of-the-art performance for cooperative BEV semantic
segmentation. Moreover, CoBEVT is shown to be generalizable to other tasks,
including 1) BEV segmentation with single-agent multi-camera and 2) 3D object
detection with multi-agent LiDAR systems, and achieves state-of-the-art
performance with real-time inference speed.
Related papers
- OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation [57.2213693781672]
Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems.
We propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance.
Our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation.
arXiv Detail & Related papers (2024-07-18T03:48:22Z) - IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception [9.117534139771738]
Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving.
Current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images.
This work proposes an instance-level fusion transformer for visual collaborative perception.
arXiv Detail & Related papers (2024-07-13T11:38:15Z) - CoBEVFusion: Cooperative Perception with LiDAR-Camera Bird's-Eye View
Fusion [0.0]
Recent approaches in cooperative perception only share single sensor information such as cameras or LiDAR.
We present a framework, called CoBEVFusion, that fuses LiDAR and camera data to create a Bird's-Eye View (BEV) representation.
Our framework was evaluated on the cooperative perception dataset OPV2V for two perception tasks: BEV semantic segmentation and 3D object detection.
arXiv Detail & Related papers (2023-10-09T17:52:26Z) - ViT-BEVSeg: A Hierarchical Transformer Network for Monocular
Birds-Eye-View Segmentation [2.70519393940262]
We evaluate the use of vision transformers (ViT) as a backbone architecture to generate Bird Eye View (BEV) maps.
Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image.
We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement relative to state-of-the-art approaches.
arXiv Detail & Related papers (2022-05-31T10:18:36Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z) - V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer [58.71845618090022]
We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents.
V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention.
To validate our approach, we create a large-scale V2X perception dataset.
arXiv Detail & Related papers (2022-03-20T20:18:25Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.