Related papers: VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection

VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection

URL: http://arxiv.org/abs/2303.10975v1
Date: Mon, 20 Mar 2023 09:56:17 GMT
Title: VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection
Authors: Zhe Wang, Siqi Fan, Xiaoliang Huo, Tongda Xu, Yan Wang, Jingjing Liu, Yilun Chen, Ya-Qin Zhang
Abstract summary: Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure. We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI) VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C.
Score: 17.22491199725569
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Two major challenges prevail in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused by time asynchrony across cameras; 2) information loss when projecting 2D features into 3D space. To address these issues, We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI). First, to fully exploit the holistic perspectives from both vehicles and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that fuses infrastructure and vehicle features on selective multi-scales to correct the calibration noise introduced by camera asynchrony. Then, we design a Camera-aware Channel Masking (CCM) module that uses camera parameters as priors to augment the fused features. We further introduce a Feature Compression (FC) module with channel and spatial compression blocks to reduce the size of transmitted features for enhanced efficiency. Experiments show that VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late fusion methods with comparable transmission cost.

Related papers

Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop. The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z)
Doracamom: Joint 3D Detection and Occupancy Prediction with Multi-view 4D Radars and Cameras for Omnidirectional Perception [9.76463525667238]
We propose Doracamom, the first framework that fuses multi-view cameras and 4D radar for joint 3D object detection and semantic occupancy prediction. Code and models will be publicly available.
arXiv Detail & Related papers (2025-01-26T04:24:07Z)
CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection [9.509625131289429]
We introduce CRT-Fusion, a novel framework that integrates temporal information into radar-camera fusion. CRT-Fusion achieves state-of-the-art performance for radar-camera-based 3D object detection.
arXiv Detail & Related papers (2024-11-05T11:25:19Z)
RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network [34.45694077040797]
We present a radar-camera fusion 3D object detection framework called BEEVDet. RadarBEVNet encodes sparse radar points into a dense bird's-eye-view feature. Our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks.
arXiv Detail & Related papers (2024-09-08T05:14:27Z)
EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection [23.32916754209488]
Two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection. We propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF) Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.
arXiv Detail & Related papers (2024-02-23T11:35:48Z)
Multi-target multi-camera vehicle tracking using transformer-based camera link model and spatial-temporal information [29.34298951501007]
Multi-target multi-camera tracking of vehicles, i.e. tracking vehicles across multiple cameras, is a crucial application for the development of smart city and intelligent traffic system. Main challenges of MTMCT of vehicles include the intra-class variability of the same vehicle and inter-class similarity between different vehicles. We propose a transformer-based camera link model with spatial and temporal filtering to conduct cross camera tracking.
arXiv Detail & Related papers (2023-01-18T22:27:08Z)
DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving. We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z)
MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems. Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion. We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z)
A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving. We present SimMOD, a Simple baseline for Multi-camera Object Detection. We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z)
Know Your Surroundings: Panoramic Multi-Object Tracking by Multimodality Collaboration [56.01625477187448]
We propose a MultiModality PAnoramic multi-object Tracking framework (MMPAT) It takes both 2D panorama images and 3D point clouds as input and then infers target trajectories using the multimodality data. We evaluate the proposed method on the JRDB dataset, where the MMPAT achieves the top performance in both the detection and tracking tasks.
arXiv Detail & Related papers (2021-05-31T03:16:38Z)
Robust 2D/3D Vehicle Parsing in CVIS [54.825777404511605]
We present a novel approach to robustly detect and perceive vehicles in different camera views as part of a cooperative vehicle-infrastructure system (CVIS) Our formulation is designed for arbitrary camera views and makes no assumptions about intrinsic or extrinsic parameters. In practice, our approach outperforms SOTA methods on 2D detection, instance segmentation, and 6-DoF pose estimation.
arXiv Detail & Related papers (2021-03-11T03:35:05Z)
Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images. Our approach is fully automatic without any human interaction. We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
Traffic-Aware Multi-Camera Tracking of Vehicles Based on ReID and Camera Link Model [43.850588717944916]
Multi-target multi-camera tracking (MTMCT) is a crucial technique for smart city applications. We propose an effective and reliable MTMCT framework for vehicles. Our proposed MTMCT is evaluated on the CityFlow dataset and achieves a new state-of-the-art performance with IDF1 of 74.93%.
arXiv Detail & Related papers (2020-08-22T08:54:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.