Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object
Tracking
- URL: http://arxiv.org/abs/2203.16268v1
- Date: Wed, 30 Mar 2022 13:00:27 GMT
- Title: Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object
Tracking
- Authors: Guangming Wang, Chensheng Peng, Jinpeng Zhang, Hesheng Wang
- Abstract summary: We introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion.
Our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion.
- Score: 23.130490413184596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiple object tracking (MOT) is a significant task in achieving autonomous
driving. Traditional works attempt to complete this task, either based on point
clouds (PC) collected by LiDAR, or based on images captured from cameras.
However, relying on one single sensor is not robust enough, because it might
fail during the tracking process. On the other hand, feature fusion from
multiple modalities contributes to the improvement of accuracy. As a result,
new techniques based on different sensors integrating features from multiple
modalities are being developed. Texture information from RGB cameras and 3D
structure information from Lidar have respective advantages under different
circumstances. However, it's not easy to achieve effective feature fusion
because of completely distinct information modalities. Previous fusion methods
usually fuse the top-level features after the backbones extract the features
from different modalities. In this paper, we first introduce PointNet++ to
obtain multi-scale deep representations of point cloud to make it adaptive to
our proposed Interactive Feature Fusion between multi-scale features of images
and point clouds. Specifically, through multi-scale interactive query and
fusion between pixel-level and point-level features, our method, can obtain
more distinguishing features to improve the performance of multiple object
tracking. Besides, we explore the effectiveness of pre-training on each single
modality and fine-tuning on the fusion-based model. The experimental results
demonstrate that our method can achieve good performance on the KITTI benchmark
and outperform other approaches without using multi-scale feature fusion.
Moreover, the ablation studies indicates the effectiveness of multi-scale
feature fusion and pre-training on single modality.
Related papers
- VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework
for Multi-Modal 3D Object Detection [33.46363259200292]
Existing voxel-based methods face challenges when fusing sparse voxel features with dense image features in a one-to-one manner.
We present VoxelNextFusion, a multi-modal 3D object detection framework specifically designed for voxel-based methods.
arXiv Detail & Related papers (2024-01-05T08:10:49Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal
Consistent Transformer for 3D Object Detection [14.457844173630667]
We propose a novel end-to-end multi-modal fusion transformer-based framework, dubbed FusionFormer.
By developing a uniform sampling strategy, our method can easily sample from 2D image and 3D voxel features spontaneously.
Our method achieves state-of-the-art single model performance of 72.6% mAP and 75.1% NDS in the 3D object detection task without test time augmentation.
arXiv Detail & Related papers (2023-09-11T06:27:25Z) - MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection [54.52102265418295]
We propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection.
For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features.
For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module, which exploits image semantics to rectify the confidence of detection candidates.
arXiv Detail & Related papers (2023-07-18T11:26:02Z) - A Generalized Multi-Modal Fusion Detection Framework [7.951044844083936]
LiDAR point clouds have become the most common data source in autonomous driving.
Due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios.
We propose a generic 3D detection framework called MMFusion, using multi-modal features.
arXiv Detail & Related papers (2023-03-13T12:38:07Z) - Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object
Detection [16.198358858773258]
Multi-modal 3D object detection has been an active research topic in autonomous driving.
It is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels.
Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels.
arXiv Detail & Related papers (2022-10-18T06:15:56Z) - BIMS-PU: Bi-Directional and Multi-Scale Point Cloud Upsampling [60.257912103351394]
We develop a new point cloud upsampling pipeline called BIMS-PU.
We decompose the up/downsampling procedure into several up/downsampling sub-steps by breaking the target sampling factor into smaller factors.
We show that our method achieves superior results to state-of-the-art approaches.
arXiv Detail & Related papers (2022-06-25T13:13:37Z) - Perception-aware Multi-sensor Fusion for 3D LiDAR Semantic Segmentation [59.42262859654698]
3D semantic segmentation is important in scene understanding for many applications, such as auto-driving and robotics.
Existing fusion-based methods may not achieve promising performance due to vast difference between two modalities.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to exploit perceptual information from two modalities.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - Exploring Data Augmentation for Multi-Modality 3D Object Detection [82.9988604088494]
It is counter-intuitive that multi-modality methods based on point cloud and images perform only marginally better or sometimes worse than approaches that solely use point cloud.
We propose a pipeline, named transformation flow, to bridge the gap between single and multi-modality data augmentation with transformation reversing and replaying.
Our method also wins the best PKL award in the 3rd nuScenes detection challenge.
arXiv Detail & Related papers (2020-12-23T15:23:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.