V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer
- URL: http://arxiv.org/abs/2203.10638v1
- Date: Sun, 20 Mar 2022 20:18:25 GMT
- Title: V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer
- Authors: Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming-Hsuan Yang, Jiaqi
Ma
- Abstract summary: We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents.
V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention.
To validate our approach, we create a large-scale V2X perception dataset.
- Score: 58.71845618090022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate the application of Vehicle-to-Everything (V2X)
communication to improve the perception performance of autonomous vehicles. We
present a robust cooperative perception framework with V2X communication using
a novel vision Transformer. Specifically, we build a holistic attention model,
namely V2X-ViT, to effectively fuse information across on-road agents (i.e.,
vehicles and infrastructure). V2X-ViT consists of alternating layers of
heterogeneous multi-agent self-attention and multi-scale window self-attention,
which captures inter-agent interaction and per-agent spatial relationships.
These key modules are designed in a unified Transformer architecture to handle
common V2X challenges, including asynchronous information sharing, pose errors,
and heterogeneity of V2X components. To validate our approach, we create a
large-scale V2X perception dataset using CARLA and OpenCDA. Extensive
experimental results demonstrate that V2X-ViT sets new state-of-the-art
performance for 3D object detection and achieves robust performance even under
harsh, noisy environments. The dataset, source code, and trained models will be
open-sourced.
Related papers
- Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration [56.75198775820637]
Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems.
To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes.
Our dataset provides precisely aligned point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training.
arXiv Detail & Related papers (2025-02-19T23:53:00Z) - V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection [18.694510415777632]
V2X-DGPE is a high-accuracy and robust V2X feature-level collaborative perception framework.
The proposed method outperforms existing approaches, achieving state-of-the-art detection performance.
arXiv Detail & Related papers (2025-01-04T19:28:55Z) - LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.
We introduce key innovations to optimize generative performance for vision tasks.
The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - CooPre: Cooperative Pretraining for V2X Cooperative Perception [47.00472259100765]
We present a self-supervised learning method for V2X cooperative perception.
We utilize the vast amount of unlabeled 3D V2X data to enhance the perception performance.
arXiv Detail & Related papers (2024-08-20T23:39:26Z) - V2X-Real: a Large-Scale Dataset for Vehicle-to-Everything Cooperative Perception [22.3955949838171]
We present V2X-Real, a large-scale dataset that includes a mixture of multiple vehicles and smart infrastructure.
Our dataset contains 33K LiDAR frames and 171K camera data with over 1.2M annotated bounding boxes of 10 categories in very challenging urban scenarios.
arXiv Detail & Related papers (2024-03-24T06:30:02Z) - Learning Cooperative Trajectory Representations for Motion Forecasting [4.380073528690906]
We propose a forecasting-oriented representation paradigm to utilize motion and interaction features from cooperative information.
We present V2X-Graph, a representative framework to achieve interpretable and end-to-end trajectory feature fusion for cooperative motion forecasting.
To further evaluate on vehicle-to-everything (V2X) scenario, we construct the first real-world V2X motion forecasting dataset V2X-Traj.
arXiv Detail & Related papers (2023-11-01T08:53:05Z) - HM-ViT: Hetero-modal Vehicle-to-Vehicle Cooperative perception with
vision transformer [4.957079586254435]
HM-ViT is the first unified multi-agent hetero-modal cooperative perception framework.
It can collaboratively predict 3D objects for highly dynamic vehicle-to-vehicle (V2V) collaborations with varying numbers and types of agents.
arXiv Detail & Related papers (2023-04-20T20:09:59Z) - V2V4Real: A Real-world Large-scale Dataset for Vehicle-to-Vehicle
Cooperative Perception [49.7212681947463]
Vehicle-to-Vehicle (V2V) cooperative perception system has great potential to revolutionize the autonomous driving industry.
We present V2V4Real, the first large-scale real-world multi-modal dataset for V2V perception.
Our dataset covers a driving area of 410 km, comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps.
arXiv Detail & Related papers (2023-03-14T02:49:20Z) - CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers [36.838065731893735]
CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions.
CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
arXiv Detail & Related papers (2022-07-05T17:59:28Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.