CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers
- URL: http://arxiv.org/abs/2207.02202v1
- Date: Tue, 5 Jul 2022 17:59:28 GMT
- Title: CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers
- Authors: Runsheng Xu, Zhengzhong Tu, Hao Xiang, Wei Shao, Bolei Zhou, Jiaqi Ma
- Abstract summary: CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions.
CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
- Score: 36.838065731893735
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Bird's eye view (BEV) semantic segmentation plays a crucial role in spatial
sensing for autonomous driving. Although recent literature has made significant
progress on BEV map understanding, they are all based on single-agent
camera-based systems which are difficult to handle occlusions and detect
distant objects in complex traffic scenes. Vehicle-to-Vehicle (V2V)
communication technologies have enabled autonomous vehicles to share sensing
information, which can dramatically improve the perception performance and
range as compared to single-agent systems. In this paper, we propose CoBEVT,
the first generic multi-agent multi-camera perception framework that can
cooperatively generate BEV map predictions. To efficiently fuse camera features
from multi-view and multi-agent data in an underlying Transformer architecture,
we design a fused axial attention or FAX module, which can capture sparsely
local and global spatial interactions across views and agents. The extensive
experiments on the V2V perception dataset, OPV2V, demonstrate that CoBEVT
achieves state-of-the-art performance for cooperative BEV semantic
segmentation. Moreover, CoBEVT is shown to be generalizable to other tasks,
including 1) BEV segmentation with single-agent multi-camera and 2) 3D object
detection with multi-agent LiDAR systems, and achieves state-of-the-art
performance with real-time inference speed.
Related papers
- Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop.
The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z) - OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation [57.2213693781672]
Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems.
We propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance.
Our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation.
arXiv Detail & Related papers (2024-07-18T03:48:22Z) - IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception [9.117534139771738]
Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving.
Current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images.
This work proposes an instance-level fusion transformer for visual collaborative perception.
arXiv Detail & Related papers (2024-07-13T11:38:15Z) - BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents [56.33989853438012]
We propose BEVWorld, a framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View latent space for holistic environment modeling.
The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model.
arXiv Detail & Related papers (2024-07-08T07:26:08Z) - DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection [3.526990431236137]
Multi-view camera-only 3D object detection largely follows two primary paradigms: exploiting bird's-eye-view (BEV) representations or focusing on perspective-view (PV) features.
We propose DuoSpaceNet, a novel framework that fully unifies BEV and PV feature spaces within a single detection pipeline for comprehensive 3D perception.
arXiv Detail & Related papers (2024-05-17T07:04:29Z) - CoBEVFusion: Cooperative Perception with LiDAR-Camera Bird's-Eye View
Fusion [0.0]
Recent approaches in cooperative perception only share single sensor information such as cameras or LiDAR.
We present a framework, called CoBEVFusion, that fuses LiDAR and camera data to create a Bird's-Eye View (BEV) representation.
Our framework was evaluated on the cooperative perception dataset OPV2V for two perception tasks: BEV semantic segmentation and 3D object detection.
arXiv Detail & Related papers (2023-10-09T17:52:26Z) - ViT-BEVSeg: A Hierarchical Transformer Network for Monocular
Birds-Eye-View Segmentation [2.70519393940262]
We evaluate the use of vision transformers (ViT) as a backbone architecture to generate Bird Eye View (BEV) maps.
Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image.
We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement relative to state-of-the-art approaches.
arXiv Detail & Related papers (2022-05-31T10:18:36Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z) - V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer [58.71845618090022]
We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents.
V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention.
To validate our approach, we create a large-scale V2X perception dataset.
arXiv Detail & Related papers (2022-03-20T20:18:25Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.