V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer
- URL: http://arxiv.org/abs/2203.10638v1
- Date: Sun, 20 Mar 2022 20:18:25 GMT
- Title: V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer
- Authors: Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming-Hsuan Yang, Jiaqi
Ma
- Abstract summary: We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents.
V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention.
To validate our approach, we create a large-scale V2X perception dataset.
- Score: 58.71845618090022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate the application of Vehicle-to-Everything (V2X)
communication to improve the perception performance of autonomous vehicles. We
present a robust cooperative perception framework with V2X communication using
a novel vision Transformer. Specifically, we build a holistic attention model,
namely V2X-ViT, to effectively fuse information across on-road agents (i.e.,
vehicles and infrastructure). V2X-ViT consists of alternating layers of
heterogeneous multi-agent self-attention and multi-scale window self-attention,
which captures inter-agent interaction and per-agent spatial relationships.
These key modules are designed in a unified Transformer architecture to handle
common V2X challenges, including asynchronous information sharing, pose errors,
and heterogeneity of V2X components. To validate our approach, we create a
large-scale V2X perception dataset using CARLA and OpenCDA. Extensive
experimental results demonstrate that V2X-ViT sets new state-of-the-art
performance for 3D object detection and achieves robust performance even under
harsh, noisy environments. The dataset, source code, and trained models will be
open-sourced.
Related papers
- CooPre: Cooperative Pretraining for V2X Cooperative Perception [47.00472259100765]
We present a self-supervised learning method for V2X cooperative perception.
We utilize the vast amount of unlabeled 3D V2X data to enhance the perception performance.
arXiv Detail & Related papers (2024-08-20T23:39:26Z) - DI-V2X: Learning Domain-Invariant Representation for
Vehicle-Infrastructure Collaborative 3D Object Detection [78.09431523221458]
DI-V2X aims to learn Domain-Invariant representations through a new distillation framework.
DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module.
arXiv Detail & Related papers (2023-12-25T14:40:46Z) - Learning Cooperative Trajectory Representations for Motion Forecasting [4.380073528690906]
We propose a forecasting-oriented representation paradigm to utilize motion and interaction features from cooperative information.
We present V2X-Graph, a representative framework to achieve interpretable and end-to-end trajectory feature fusion for cooperative motion forecasting.
To further evaluate on vehicle-to-everything (V2X) scenario, we construct the first real-world V2X motion forecasting dataset V2X-Traj.
arXiv Detail & Related papers (2023-11-01T08:53:05Z) - HM-ViT: Hetero-modal Vehicle-to-Vehicle Cooperative perception with
vision transformer [4.957079586254435]
HM-ViT is the first unified multi-agent hetero-modal cooperative perception framework.
It can collaboratively predict 3D objects for highly dynamic vehicle-to-vehicle (V2V) collaborations with varying numbers and types of agents.
arXiv Detail & Related papers (2023-04-20T20:09:59Z) - V2V4Real: A Real-world Large-scale Dataset for Vehicle-to-Vehicle
Cooperative Perception [49.7212681947463]
Vehicle-to-Vehicle (V2V) cooperative perception system has great potential to revolutionize the autonomous driving industry.
We present V2V4Real, the first large-scale real-world multi-modal dataset for V2V perception.
Our dataset covers a driving area of 410 km, comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps.
arXiv Detail & Related papers (2023-03-14T02:49:20Z) - CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers [36.838065731893735]
CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions.
CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
arXiv Detail & Related papers (2022-07-05T17:59:28Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - V2X-Sim: A Virtual Collaborative Perception Dataset for Autonomous
Driving [26.961213523096948]
Vehicle-to-everything (V2X) denotes the collaboration between a vehicle and any entity in its surrounding.
We present the V2X-Sim dataset, the first public large-scale collaborative perception dataset in autonomous driving.
arXiv Detail & Related papers (2022-02-17T05:14:02Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.