EMIFF: Enhanced Multi-scale Image Feature Fusion for
Vehicle-Infrastructure Cooperative 3D Object Detection
- URL: http://arxiv.org/abs/2402.15272v1
- Date: Fri, 23 Feb 2024 11:35:48 GMT
- Title: EMIFF: Enhanced Multi-scale Image Feature Fusion for
Vehicle-Infrastructure Cooperative 3D Object Detection
- Authors: Zhe Wang, Siqi Fan, Xiaoliang Huo, Tongda Xu, Yan Wang, Jingjing Liu,
Yilun Chen, Ya-Qin Zhang
- Abstract summary: Two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection.
We propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF)
Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.
- Score: 23.32916754209488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In autonomous driving, cooperative perception makes use of multi-view cameras
from both vehicles and infrastructure, providing a global vantage point with
rich semantic context of road conditions beyond a single vehicle viewpoint.
Currently, two major challenges persist in vehicle-infrastructure cooperative
3D (VIC3D) object detection: $1)$ inherent pose errors when fusing multi-view
images, caused by time asynchrony across cameras; $2)$ information loss in
transmission process resulted from limited communication bandwidth. To address
these issues, we propose a novel camera-based 3D detection framework for VIC3D
task, Enhanced Multi-scale Image Feature Fusion (EMIFF). To fully exploit
holistic perspectives from both vehicles and infrastructure, we propose
Multi-scale Cross Attention (MCA) and Camera-aware Channel Masking (CCM)
modules to enhance infrastructure and vehicle features at scale, spatial, and
channel levels to correct the pose error introduced by camera asynchrony. We
also introduce a Feature Compression (FC) module with channel and spatial
compression blocks for transmission efficiency. Experiments show that EMIFF
achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous
early-fusion and late-fusion methods with comparable transmission costs.
Related papers
- CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection [9.509625131289429]
We introduce CRT-Fusion, a novel framework that integrates temporal information into radar-camera fusion.
CRT-Fusion achieves state-of-the-art performance for radar-camera-based 3D object detection.
arXiv Detail & Related papers (2024-11-05T11:25:19Z) - RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network [34.45694077040797]
We present a radar-camera fusion 3D object detection framework called BEEVDet.
RadarBEVNet encodes sparse radar points into a dense bird's-eye-view feature.
Our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks.
arXiv Detail & Related papers (2024-09-08T05:14:27Z) - Application of 2D Homography for High Resolution Traffic Data Collection
using CCTV Cameras [9.946460710450319]
This study implements a three-stage video analytics framework for extracting high-resolution traffic data from CCTV cameras.
The key components of the framework include object recognition, perspective transformation, and vehicle trajectory reconstruction.
The results of the study showed about +/- 4.5% error rate for directional traffic counts, less than 10% MSE for speed bias between camera estimates.
arXiv Detail & Related papers (2024-01-14T07:33:14Z) - Mutual Information-driven Triple Interaction Network for Efficient Image
Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing.
The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal.
The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z) - Multi-Modal 3D Object Detection by Box Matching [109.43430123791684]
We propose a novel Fusion network by Box Matching (FBMNet) for multi-modal 3D detection.
With the learned assignments between 3D and 2D object proposals, the fusion for detection can be effectively performed by combing their ROI features.
arXiv Detail & Related papers (2023-05-12T18:08:51Z) - VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for
Camera-based 3D Object Detection [17.22491199725569]
Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure.
We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI)
VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C.
arXiv Detail & Related papers (2023-03-20T09:56:17Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - DAIR-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative
3D Object Detection [8.681912341444901]
DAIR-V2X is the first large-scale, multi-modality, multi-view dataset from real scenarios for Vehicle-Infrastructure Cooperative Autonomous Driving.
DAIR-V2X comprises 71254 LiDAR frames and 71254 Camera frames, and all frames are captured from real scenes with 3D annotations.
arXiv Detail & Related papers (2022-04-12T07:13:33Z) - EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object
Detection [56.03081616213012]
We propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion(CB-Fusion) module.
The proposed CB-Fusion module boosts the plentiful semantic information of point features with the image features in a cascade bi-directional interaction fusion manner.
The experiment results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-21T10:48:34Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.