DepTR-MOT: Unveiling the Potential of Depth-Informed Trajectory Refinement for Multi-Object Tracking
- URL: http://arxiv.org/abs/2509.17323v1
- Date: Mon, 22 Sep 2025 02:58:04 GMT
- Title: DepTR-MOT: Unveiling the Potential of Depth-Informed Trajectory Refinement for Multi-Object Tracking
- Authors: Buyin Deng, Lingxin Huang, Kai Luo, Fei Teng, Kailun Yang,
- Abstract summary: We introduce DepTR-MOT, a DETR-based detector enhanced with instance-level depth information.<n>Specifically, we propose two key innovations: (i) foundation model-based instance-level soft depth label supervision, which refines depth prediction, and (ii) the distillation of dense depth maps to maintain global depth consistency.
- Score: 11.870160177470753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Multi-Object Tracking (MOT) is a crucial component of robotic perception, yet existing Tracking-By-Detection (TBD) methods often rely on 2D cues, such as bounding boxes and motion modeling, which struggle under occlusions and close-proximity interactions. Trackers relying on these 2D cues are particularly unreliable in robotic environments, where dense targets and frequent occlusions are common. While depth information has the potential to alleviate these issues, most existing MOT datasets lack depth annotations, leading to its underexploited role in the domain. To unveil the potential of depth-informed trajectory refinement, we introduce DepTR-MOT, a DETR-based detector enhanced with instance-level depth information. Specifically, we propose two key innovations: (i) foundation model-based instance-level soft depth label supervision, which refines depth prediction, and (ii) the distillation of dense depth maps to maintain global depth consistency. These strategies enable DepTR-MOT to output instance-level depth during inference, without requiring foundation models and without additional computational cost. By incorporating depth cues, our method enhances the robustness of the TBD paradigm, effectively resolving occlusion and close-proximity challenges. Experiments on both the QuadTrack and DanceTrack datasets demonstrate the effectiveness of our approach, achieving HOTA scores of 27.59 and 44.47, respectively. In particular, results on QuadTrack, a robotic platform MOT dataset, highlight the advantages of our method in handling occlusion and close-proximity challenges in robotic tracking. The source code will be made publicly available at https://github.com/warriordby/DepTR-MOT.
Related papers
- Is This Tracker On? A Benchmark Protocol for Dynamic Tracking [6.23176842962524]
ITTO is a new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods.<n>We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity.
arXiv Detail & Related papers (2025-10-22T17:53:56Z) - GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object Tracking [11.436294975354556]
GRASPTrack is a novel MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline.<n>These 3D point clouds are then voxelized to enable a precise and robust Voxel-Based 3D Intersection-over-Union.
arXiv Detail & Related papers (2025-08-11T15:56:21Z) - Optimizing Cooperative Multi-Object Tracking using Graph Signal Processing [45.68287260385148]
This paper proposes a novel Cooperative MOT framework for tracking objects in 3D LiDAR scene.<n>By exploiting a fully connected graph topology defined by the detected bounding boxes, we employ the Graph Laplacian processing optimization technique.<n>An extensive evaluation study has been conducted, using the real-world V2V4Real dataset.
arXiv Detail & Related papers (2025-06-11T07:21:58Z) - Enhanced Object Tracking by Self-Supervised Auxiliary Depth Estimation Learning [15.364238035194854]
We propose a new method, named MDETrack, which trains a tracking network with an additional capability to understand the depth of scenes.
The results show an improved tracking accuracy even without real depth.
arXiv Detail & Related papers (2024-05-23T05:43:38Z) - SparseTrack: Multi-Object Tracking by Performing Scene Decomposition
based on Pseudo-Depth [84.64121608109087]
We propose a pseudo-depth estimation method for obtaining the relative depth of targets from 2D images.
Secondly, we design a depth cascading matching (DCM) algorithm, which can use the obtained depth information to convert a dense target set into multiple sparse target subsets.
By integrating the pseudo-depth method and the DCM strategy into the data association process, we propose a new tracker, called SparseTrack.
arXiv Detail & Related papers (2023-06-08T14:36:10Z) - Learning Dual-Fused Modality-Aware Representations for RGBD Tracking [67.14537242378988]
Compared with the traditional RGB object tracking, the addition of the depth modality can effectively solve the target and background interference.
Some existing RGBD trackers use the two modalities separately and thus some particularly useful shared information between them is ignored.
We propose a novel Dual-fused Modality-aware Tracker (termed DMTracker) which aims to learn informative and discriminative representations of the target objects for robust RGBD tracking.
arXiv Detail & Related papers (2022-11-06T07:59:07Z) - RLM-Tracking: Online Multi-Pedestrian Tracking Supported by Relative
Location Mapping [5.9669075749248774]
Problem of multi-object tracking is a fundamental computer vision research focus, widely used in public safety, transport, autonomous vehicles, robotics, and other regions involving artificial intelligence.
In this paper, we design a new multi-object tracker for the above issues that contains an object textbfRelative Location Mapping (RLM) model and textbfTarget Region Density (TRD) model.
The new tracker is more sensitive to the differences in position relationships between objects.
It can introduce low-score detection frames into different regions in real-time according to the density of object
arXiv Detail & Related papers (2022-10-19T11:37:14Z) - Minkowski Tracker: A Sparse Spatio-Temporal R-CNN for Joint Object
Detection and Tracking [53.64390261936975]
We present Minkowski Tracker, a sparse-temporal R-CNN that jointly solves object detection and tracking problems.
Inspired by region-based CNN (R-CNN), we propose to track motion as a second stage of the object detector R-CNN.
We show in large-scale experiments that the overall performance gain of our method is due to four factors.
arXiv Detail & Related papers (2022-08-22T04:47:40Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z) - Tracking-by-Counting: Using Network Flows on Crowd Density Maps for
Tracking Multiple Targets [96.98888948518815]
State-of-the-art multi-object tracking(MOT) methods follow the tracking-by-detection paradigm.
We propose a new MOT paradigm, tracking-by-counting, tailored for crowded scenes.
arXiv Detail & Related papers (2020-07-18T19:51:53Z) - DPANet: Depth Potentiality-Aware Gated Attention Network for RGB-D
Salient Object Detection [107.96418568008644]
We propose a novel network named DPANet to explicitly model the potentiality of the depth map and effectively integrate the cross-modal complementarity.
By introducing the depth potentiality perception, the network can perceive the potentiality of depth information in a learning-based manner.
arXiv Detail & Related papers (2020-03-19T07:27:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.