VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for
Camera-based 3D Object Detection
- URL: http://arxiv.org/abs/2303.10975v1
- Date: Mon, 20 Mar 2023 09:56:17 GMT
- Title: VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for
Camera-based 3D Object Detection
- Authors: Zhe Wang, Siqi Fan, Xiaoliang Huo, Tongda Xu, Yan Wang, Jingjing Liu,
Yilun Chen, Ya-Qin Zhang
- Abstract summary: Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure.
We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI)
VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C.
- Score: 17.22491199725569
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection
(VIC3D) makes use of multi-view cameras from both vehicles and traffic
infrastructure, providing a global vantage point with rich semantic context of
road conditions beyond a single vehicle viewpoint. Two major challenges prevail
in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused
by time asynchrony across cameras; 2) information loss when projecting 2D
features into 3D space. To address these issues, We propose a novel 3D object
detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion
(VIMI). First, to fully exploit the holistic perspectives from both vehicles
and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that
fuses infrastructure and vehicle features on selective multi-scales to correct
the calibration noise introduced by camera asynchrony. Then, we design a
Camera-aware Channel Masking (CCM) module that uses camera parameters as priors
to augment the fused features. We further introduce a Feature Compression (FC)
module with channel and spatial compression blocks to reduce the size of
transmitted features for enhanced efficiency. Experiments show that VIMI
achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset,
DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late
fusion methods with comparable transmission cost.
Related papers
- EMIFF: Enhanced Multi-scale Image Feature Fusion for
Vehicle-Infrastructure Cooperative 3D Object Detection [23.32916754209488]
Two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection.
We propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF)
Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.
arXiv Detail & Related papers (2024-02-23T11:35:48Z) - Multi-target multi-camera vehicle tracking using transformer-based
camera link model and spatial-temporal information [29.34298951501007]
Multi-target multi-camera tracking of vehicles, i.e. tracking vehicles across multiple cameras, is a crucial application for the development of smart city and intelligent traffic system.
Main challenges of MTMCT of vehicles include the intra-class variability of the same vehicle and inter-class similarity between different vehicles.
We propose a transformer-based camera link model with spatial and temporal filtering to conduct cross camera tracking.
arXiv Detail & Related papers (2023-01-18T22:27:08Z) - DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving.
We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z) - MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth
Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems.
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion.
We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View
Representation [116.6111047218081]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - DAIR-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative
3D Object Detection [8.681912341444901]
DAIR-V2X is the first large-scale, multi-modality, multi-view dataset from real scenarios for Vehicle-Infrastructure Cooperative Autonomous Driving.
DAIR-V2X comprises 71254 LiDAR frames and 71254 Camera frames, and all frames are captured from real scenes with 3D annotations.
arXiv Detail & Related papers (2022-04-12T07:13:33Z) - Know Your Surroundings: Panoramic Multi-Object Tracking by Multimodality
Collaboration [56.01625477187448]
We propose a MultiModality PAnoramic multi-object Tracking framework (MMPAT)
It takes both 2D panorama images and 3D point clouds as input and then infers target trajectories using the multimodality data.
We evaluate the proposed method on the JRDB dataset, where the MMPAT achieves the top performance in both the detection and tracking tasks.
arXiv Detail & Related papers (2021-05-31T03:16:38Z) - Robust 2D/3D Vehicle Parsing in CVIS [54.825777404511605]
We present a novel approach to robustly detect and perceive vehicles in different camera views as part of a cooperative vehicle-infrastructure system (CVIS)
Our formulation is designed for arbitrary camera views and makes no assumptions about intrinsic or extrinsic parameters.
In practice, our approach outperforms SOTA methods on 2D detection, instance segmentation, and 6-DoF pose estimation.
arXiv Detail & Related papers (2021-03-11T03:35:05Z) - Traffic-Aware Multi-Camera Tracking of Vehicles Based on ReID and Camera
Link Model [43.850588717944916]
Multi-target multi-camera tracking (MTMCT) is a crucial technique for smart city applications.
We propose an effective and reliable MTMCT framework for vehicles.
Our proposed MTMCT is evaluated on the CityFlow dataset and achieves a new state-of-the-art performance with IDF1 of 74.93%.
arXiv Detail & Related papers (2020-08-22T08:54:47Z) - 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View
Spatial Feature Fusion for 3D Object Detection [10.507404260449333]
We propose a new architecture for fusing camera and LiDAR sensors for 3D object detection.
The proposed 3D-CVF achieves state-of-the-art performance in the KITTI benchmark.
arXiv Detail & Related papers (2020-04-27T08:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.