Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object
Tracking
- URL: http://arxiv.org/abs/2203.16268v1
- Date: Wed, 30 Mar 2022 13:00:27 GMT
- Title: Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object
Tracking
- Authors: Guangming Wang, Chensheng Peng, Jinpeng Zhang, Hesheng Wang
- Abstract summary: We introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion.
Our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion.
- Score: 23.130490413184596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiple object tracking (MOT) is a significant task in achieving autonomous
driving. Traditional works attempt to complete this task, either based on point
clouds (PC) collected by LiDAR, or based on images captured from cameras.
However, relying on one single sensor is not robust enough, because it might
fail during the tracking process. On the other hand, feature fusion from
multiple modalities contributes to the improvement of accuracy. As a result,
new techniques based on different sensors integrating features from multiple
modalities are being developed. Texture information from RGB cameras and 3D
structure information from Lidar have respective advantages under different
circumstances. However, it's not easy to achieve effective feature fusion
because of completely distinct information modalities. Previous fusion methods
usually fuse the top-level features after the backbones extract the features
from different modalities. In this paper, we first introduce PointNet++ to
obtain multi-scale deep representations of point cloud to make it adaptive to
our proposed Interactive Feature Fusion between multi-scale features of images
and point clouds. Specifically, through multi-scale interactive query and
fusion between pixel-level and point-level features, our method, can obtain
more distinguishing features to improve the performance of multiple object
tracking. Besides, we explore the effectiveness of pre-training on each single
modality and fine-tuning on the fusion-based model. The experimental results
demonstrate that our method can achieve good performance on the KITTI benchmark
and outperform other approaches without using multi-scale feature fusion.
Moreover, the ablation studies indicates the effectiveness of multi-scale
feature fusion and pre-training on single modality.
Related papers
- Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV)
We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels.
Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework
for Multi-Modal 3D Object Detection [33.46363259200292]
Existing voxel-based methods face challenges when fusing sparse voxel features with dense image features in a one-to-one manner.
We present VoxelNextFusion, a multi-modal 3D object detection framework specifically designed for voxel-based methods.
arXiv Detail & Related papers (2024-01-05T08:10:49Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal
Consistent Transformer for 3D Object Detection [14.457844173630667]
We propose a novel end-to-end multi-modal fusion transformer-based framework, dubbed FusionFormer.
By developing a uniform sampling strategy, our method can easily sample from 2D image and 3D voxel features spontaneously.
Our method achieves state-of-the-art single model performance of 72.6% mAP and 75.1% NDS in the 3D object detection task without test time augmentation.
arXiv Detail & Related papers (2023-09-11T06:27:25Z) - MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection [54.52102265418295]
We propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection.
For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features.
For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module, which exploits image semantics to rectify the confidence of detection candidates.
arXiv Detail & Related papers (2023-07-18T11:26:02Z) - A Generalized Multi-Modal Fusion Detection Framework [7.951044844083936]
LiDAR point clouds have become the most common data source in autonomous driving.
Due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios.
We propose a generic 3D detection framework called MMFusion, using multi-modal features.
arXiv Detail & Related papers (2023-03-13T12:38:07Z) - BIMS-PU: Bi-Directional and Multi-Scale Point Cloud Upsampling [60.257912103351394]
We develop a new point cloud upsampling pipeline called BIMS-PU.
We decompose the up/downsampling procedure into several up/downsampling sub-steps by breaking the target sampling factor into smaller factors.
We show that our method achieves superior results to state-of-the-art approaches.
arXiv Detail & Related papers (2022-06-25T13:13:37Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.