MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking
- URL: http://arxiv.org/abs/2305.06794v2
- Date: Tue, 15 Aug 2023 03:24:57 GMT
- Title: MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking
- Authors: Zhiheng Li, Yubo Cui, Yu Lin, Zheng Fang
- Abstract summary: 3D single object tracking plays a crucial role in computer vision.
We propose a Multi-modal Multi-level Fusion Tracker (MMF-Track), which exploits the image texture and geometry characteristic of point clouds to track 3D target.
Experiments show that our method achieves state-of-the-art performance on KITTI (39% Success and 42% Precision gains against previous multi-modal method) and is also competitive on NuScenes.
- Score: 26.405519771454102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D single object tracking plays a crucial role in computer vision. Mainstream
methods mainly rely on point clouds to achieve geometry matching between target
template and search area. However, textureless and incomplete point clouds make
it difficult for single-modal trackers to distinguish objects with similar
structures. To overcome the limitations of geometry matching, we propose a
Multi-modal Multi-level Fusion Tracker (MMF-Track), which exploits the image
texture and geometry characteristic of point clouds to track 3D target.
Specifically, we first propose a Space Alignment Module (SAM) to align RGB
images with point clouds in 3D space, which is the prerequisite for
constructing inter-modal associations. Then, in feature interaction level, we
design a Feature Interaction Module (FIM) based on dual-stream structure, which
enhances intra-modal features in parallel and constructs inter-modal semantic
associations. Meanwhile, in order to refine each modal feature, we introduce a
Coarse-to-Fine Interaction Module (CFIM) to realize the hierarchical feature
interaction at different scales. Finally, in similarity fusion level, we
propose a Similarity Fusion Module (SFM) to aggregate geometry and texture
clues from the target. Experiments show that our method achieves
state-of-the-art performance on KITTI (39% Success and 42% Precision gains
against previous multi-modal method) and is also competitive on NuScenes.
Related papers
- Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D
Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds.
Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z) - Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object
Detection [16.198358858773258]
Multi-modal 3D object detection has been an active research topic in autonomous driving.
It is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels.
Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels.
arXiv Detail & Related papers (2022-10-18T06:15:56Z) - FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D
Object Detection [19.419030878019974]
unstructured 3D point clouds are filled in the 2D plane and 3D point cloud features are extracted faster using projection-aware convolution layers.
The corresponding indexes between different sensor signals are established in advance in the data preprocessing.
Two new plug-and-play fusion modules, LiCamFuse and BiLiCamFuse, are proposed.
arXiv Detail & Related papers (2022-09-15T16:13:19Z) - Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object
Tracking [23.130490413184596]
We introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion.
Our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion.
arXiv Detail & Related papers (2022-03-30T13:00:27Z) - Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation.
We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds.
We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z) - M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object
Detection with Transformers [78.48081972698888]
We present M3DeTR, which combines different point cloud representations with different feature scales based on multi-scale feature pyramids.
M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers.
arXiv Detail & Related papers (2021-04-24T06:48:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.