S3MOT: Monocular 3D Object Tracking with Selective State Space Model
- URL: http://arxiv.org/abs/2504.18068v1
- Date: Fri, 25 Apr 2025 04:45:35 GMT
- Title: S3MOT: Monocular 3D Object Tracking with Selective State Space Model
- Authors: Zhuohao Yan, Shaoquan Feng, Xingxing Li, Yuxuan Zhou, Chunxi Xia, Shengyu Li,
- Abstract summary: Multi-object tracking in 3D space is essential for advancing robotics and computer applications.<n>It remains a significant challenge in monocular setups due to the difficulty of mining 3D associations from 2D video streams.<n>We present three innovative techniques to enhance the fusion of heterogeneous cues for monocular 3D MOT.
- Score: 3.5047603107971397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for monocular 3D MOT: (1) we introduce the Hungarian State Space Model (HSSM), a novel data association mechanism that compresses contextual tracking cues across multiple paths, enabling efficient and comprehensive assignment decisions with linear complexity. HSSM features a global receptive field and dynamic weights, in contrast to traditional linear assignment algorithms that rely on hand-crafted association costs. (2) We propose Fully Convolutional One-stage Embedding (FCOE), which eliminates ROI pooling by directly using dense feature maps for contrastive learning, thus improving object re-identification accuracy under challenging conditions such as varying viewpoints and lighting. (3) We enhance 6-DoF pose estimation through VeloSSM, an encoder-decoder architecture that models temporal dependencies in velocity to capture motion dynamics, overcoming the limitations of frame-based 3D inference. Experiments on the KITTI public test benchmark demonstrate the effectiveness of our method, achieving a new state-of-the-art performance of 76.86~HOTA at 31~FPS. Our approach outperforms the previous best by significant margins of +2.63~HOTA and +3.62~AssA, showcasing its robustness and efficiency for monocular 3D MOT tasks. The code and models are available at https://github.com/bytepioneerX/s3mot.
Related papers
- Global-Aware Monocular Semantic Scene Completion with State Space Models [25.621011183332094]
Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image.<n>Existing methods are often constrained by the local receptive field of Convolutional Networks (CNNs)<n>We introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space.
arXiv Detail & Related papers (2025-03-09T11:55:40Z) - GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.<n>We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.<n>GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z) - Street Gaussians without 3D Object Tracker [86.62329193275916]
Existing methods rely on labor-intensive manual labeling of object poses to reconstruct dynamic objects in canonical space.<n>We propose a stable object tracking module by leveraging associations from 2D deep trackers within a 3D object fusion strategy.<n>We address inevitable tracking errors by further introducing a motion learning strategy in an implicit feature space that autonomously corrects trajectory errors and recovers missed detections.
arXiv Detail & Related papers (2024-12-07T05:49:42Z) - MetaSSC: Enhancing 3D Semantic Scene Completion for Autonomous Driving through Meta-Learning and Long-sequence Modeling [3.139165705827712]
We introduce MetaSSC, a novel meta-learning-based framework for semantic scene completion (SSC)<n>Our approach begins with a voxel-based semantic segmentation (SS) pretraining task, aimed at exploring the semantics and geometry of incomplete regions.<n>Using simulated cooperative perception datasets, we supervise the perception training of a single vehicle using aggregated sensor data.<n>This meta-knowledge is then adapted to the target domain through a dual-phase training strategy, enabling efficient deployment.
arXiv Detail & Related papers (2024-11-06T05:11:25Z) - Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation [22.653014803666668]
We propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features.
We aim to distill the transformer's capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer.
We evaluated the framework on datasets and nuScenes, achieving a 4x reduction in resource consumption and a 1-2% performance improvement over the current SoTA methods.
arXiv Detail & Related papers (2024-09-17T09:30:43Z) - ShaSTA-Fuse: Camera-LiDAR Sensor Fusion to Model Shape and
Spatio-Temporal Affinities for 3D Multi-Object Tracking [26.976216624424385]
3D multi-object tracking (MOT) is essential for an autonomous mobile agent to safely navigate a scene.
We aim to develop a 3D MOT framework that fuses camera and LiDAR sensor information.
arXiv Detail & Related papers (2023-10-04T02:17:59Z) - EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting
Ego-Motion Rigidity [13.02735046166494]
Self-supervised monocular scene flow estimation has received increasing attention for its simple and economical sensor setup.
We propose a superior model named EMR-MSF by borrowing the advantages of network architecture design under the scope of supervised learning.
On the KITTI scene flow benchmark, our approach improves the SF-all metric of the state-of-the-art self-supervised monocular method by 44%.
arXiv Detail & Related papers (2023-09-04T00:30:06Z) - Realistic Full-Body Tracking from Sparse Observations via Joint-Level
Modeling [13.284947022380404]
We propose a two-stage framework that can obtain accurate and smooth full-body motions with three tracking signals of head and hands only.
Our framework explicitly models the joint-level features in the first stage and utilizes them astemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage.
With extensive experiments on the AMASS motion dataset and real-captured data, we show our proposed method can achieve more accurate and smooth motion compared to existing approaches.
arXiv Detail & Related papers (2023-08-17T08:27:55Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z) - Monocular Quasi-Dense 3D Object Tracking [99.51683944057191]
A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving.
We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform.
arXiv Detail & Related papers (2021-03-12T15:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.