M3SOT: Multi-frame, Multi-field, Multi-space 3D Single Object Tracking
- URL: http://arxiv.org/abs/2312.06117v1
- Date: Mon, 11 Dec 2023 04:49:47 GMT
- Title: M3SOT: Multi-frame, Multi-field, Multi-space 3D Single Object Tracking
- Authors: Jiaming Liu, Yue Wu, Maoguo Gong, Qiguang Miao, Wenping Ma, Can Qin
- Abstract summary: 3D Single Object Tracking (SOT) stands a forefront task of computer vision, proving essential for applications like autonomous driving.
In this research, we unveil M3SOT, a novel 3D SOT framework, which synergizes multiple input frames (template sets), multiple receptive fields (continuous contexts), and multiple solution spaces (distinct tasks) in ONE model.
- Score: 41.716532647616134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D Single Object Tracking (SOT) stands a forefront task of computer vision,
proving essential for applications like autonomous driving. Sparse and occluded
data in scene point clouds introduce variations in the appearance of tracked
objects, adding complexity to the task. In this research, we unveil M3SOT, a
novel 3D SOT framework, which synergizes multiple input frames (template sets),
multiple receptive fields (continuous contexts), and multiple solution spaces
(distinct tasks) in ONE model. Remarkably, M3SOT pioneers in modeling
temporality, contexts, and tasks directly from point clouds, revisiting a
perspective on the key factors influencing SOT. To this end, we design a
transformer-based network centered on point cloud targets in the search area,
aggregating diverse contextual representations and propagating target cues by
employing historical frames. As M3SOT spans varied processing perspectives,
we've streamlined the network-trimming its depth and optimizing its
structure-to ensure a lightweight and efficient deployment for SOT
applications. We posit that, backed by practical construction, M3SOT sidesteps
the need for complex frameworks and auxiliary components to deliver sterling
results. Extensive experiments on benchmarks such as KITTI, nuScenes, and Waymo
Open Dataset demonstrate that M3SOT achieves state-of-the-art performance at 38
FPS. Our code and models are available at
https://github.com/ywu0912/TeamCode.git.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks.
RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model.
Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z) - Boosting 3D Object Detection with Semantic-Aware Multi-Branch Framework [44.44329455757931]
In autonomous driving, LiDAR sensors are vital for acquiring 3D point clouds, providing reliable geometric information.
Traditional sampling methods of preprocessing often ignore semantic features, leading to detail loss and ground point interference.
We propose a multi-branch two-stage 3D object detection framework using a Semantic-aware Multi-branch Sampling (SMS) module and multi-view constraints.
arXiv Detail & Related papers (2024-07-08T09:25:45Z) - A Point-Based Approach to Efficient LiDAR Multi-Task Perception [49.91741677556553]
PAttFormer is an efficient multi-task architecture for joint semantic segmentation and object detection in point clouds.
Unlike other LiDAR-based multi-task architectures, our proposed PAttFormer does not require separate feature encoders for task-specific point cloud representations.
Our evaluations show substantial gains from multi-task learning, improving LiDAR semantic segmentation by +1.7% in mIou and 3D object detection by +1.7% in mAP.
arXiv Detail & Related papers (2024-04-19T11:24:34Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and
Multi-View for 3D Object Retrieval [8.74845857766369]
Multi-modality 3D object retrieval is rarely developed and analyzed on large-scale datasets.
We propose self-and-cross attention based aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object retrieval.
arXiv Detail & Related papers (2023-07-20T05:46:32Z) - MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking [26.405519771454102]
3D single object tracking plays a crucial role in computer vision.
We propose a Multi-modal Multi-level Fusion Tracker (MMF-Track), which exploits the image texture and geometry characteristic of point clouds to track 3D target.
Experiments show that our method achieves state-of-the-art performance on KITTI (39% Success and 42% Precision gains against previous multi-modal method) and is also competitive on NuScenes.
arXiv Detail & Related papers (2023-05-11T13:34:02Z) - Simultaneous Multiple Object Detection and Pose Estimation using 3D
Model Infusion with Monocular Vision [21.710141497071373]
Multiple object detection and pose estimation are vital computer vision tasks.
We propose simultaneous neural modeling of both using monocular vision and 3D model infusion.
Our Simultaneous Multiple Object detection and Pose Estimation network (SMOPE-Net) is an end-to-end trainable multitasking network.
arXiv Detail & Related papers (2022-11-21T05:18:56Z) - M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object
Detection with Transformers [78.48081972698888]
We present M3DeTR, which combines different point cloud representations with different feature scales based on multi-scale feature pyramids.
M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers.
arXiv Detail & Related papers (2021-04-24T06:48:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.