VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for
Camera-based 3D Object Detection
- URL: http://arxiv.org/abs/2303.10975v1
- Date: Mon, 20 Mar 2023 09:56:17 GMT
- Title: VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for
Camera-based 3D Object Detection
- Authors: Zhe Wang, Siqi Fan, Xiaoliang Huo, Tongda Xu, Yan Wang, Jingjing Liu,
Yilun Chen, Ya-Qin Zhang
- Abstract summary: Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure.
We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI)
VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C.
- Score: 17.22491199725569
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection
(VIC3D) makes use of multi-view cameras from both vehicles and traffic
infrastructure, providing a global vantage point with rich semantic context of
road conditions beyond a single vehicle viewpoint. Two major challenges prevail
in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused
by time asynchrony across cameras; 2) information loss when projecting 2D
features into 3D space. To address these issues, We propose a novel 3D object
detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion
(VIMI). First, to fully exploit the holistic perspectives from both vehicles
and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that
fuses infrastructure and vehicle features on selective multi-scales to correct
the calibration noise introduced by camera asynchrony. Then, we design a
Camera-aware Channel Masking (CCM) module that uses camera parameters as priors
to augment the fused features. We further introduce a Feature Compression (FC)
module with channel and spatial compression blocks to reduce the size of
transmitted features for enhanced efficiency. Experiments show that VIMI
achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset,
DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late
fusion methods with comparable transmission cost.
Related papers
- CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection [9.509625131289429]
We introduce CRT-Fusion, a novel framework that integrates temporal information into radar-camera fusion.
CRT-Fusion achieves state-of-the-art performance for radar-camera-based 3D object detection.
arXiv Detail & Related papers (2024-11-05T11:25:19Z) - RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network [34.45694077040797]
We present a radar-camera fusion 3D object detection framework called BEEVDet.
RadarBEVNet encodes sparse radar points into a dense bird's-eye-view feature.
Our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks.
arXiv Detail & Related papers (2024-09-08T05:14:27Z) - EMIFF: Enhanced Multi-scale Image Feature Fusion for
Vehicle-Infrastructure Cooperative 3D Object Detection [23.32916754209488]
Two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection.
We propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF)
Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.
arXiv Detail & Related papers (2024-02-23T11:35:48Z) - Multi-target multi-camera vehicle tracking using transformer-based
camera link model and spatial-temporal information [29.34298951501007]
Multi-target multi-camera tracking of vehicles, i.e. tracking vehicles across multiple cameras, is a crucial application for the development of smart city and intelligent traffic system.
Main challenges of MTMCT of vehicles include the intra-class variability of the same vehicle and inter-class similarity between different vehicles.
We propose a transformer-based camera link model with spatial and temporal filtering to conduct cross camera tracking.
arXiv Detail & Related papers (2023-01-18T22:27:08Z) - DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving.
We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z) - MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth
Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems.
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion.
We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Know Your Surroundings: Panoramic Multi-Object Tracking by Multimodality
Collaboration [56.01625477187448]
We propose a MultiModality PAnoramic multi-object Tracking framework (MMPAT)
It takes both 2D panorama images and 3D point clouds as input and then infers target trajectories using the multimodality data.
We evaluate the proposed method on the JRDB dataset, where the MMPAT achieves the top performance in both the detection and tracking tasks.
arXiv Detail & Related papers (2021-05-31T03:16:38Z) - Robust 2D/3D Vehicle Parsing in CVIS [54.825777404511605]
We present a novel approach to robustly detect and perceive vehicles in different camera views as part of a cooperative vehicle-infrastructure system (CVIS)
Our formulation is designed for arbitrary camera views and makes no assumptions about intrinsic or extrinsic parameters.
In practice, our approach outperforms SOTA methods on 2D detection, instance segmentation, and 6-DoF pose estimation.
arXiv Detail & Related papers (2021-03-11T03:35:05Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z) - Traffic-Aware Multi-Camera Tracking of Vehicles Based on ReID and Camera
Link Model [43.850588717944916]
Multi-target multi-camera tracking (MTMCT) is a crucial technique for smart city applications.
We propose an effective and reliable MTMCT framework for vehicles.
Our proposed MTMCT is evaluated on the CityFlow dataset and achieves a new state-of-the-art performance with IDF1 of 74.93%.
arXiv Detail & Related papers (2020-08-22T08:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.