Related papers: DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

URL: http://arxiv.org/abs/2602.04692v2
Date: Fri, 06 Feb 2026 04:39:34 GMT
Title: DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking
Authors: Sijia Chen, Lijuan Ma, Yanqiu Yu, En Yu, Liman Liu, Wenbing Tao,
Abstract summary: Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions.<n>We propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking.
Score: 35.56361594180878
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models' spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.

Related papers

Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm [103.36490810025752]
Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal.<n>This work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR)<n>We propose a novel multi-modal tracker, dubbed RDTTrack, which integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model.
arXiv Detail & Related papers (2025-09-29T13:05:15Z)
Learning Dual-Fused Modality-Aware Representations for RGBD Tracking [67.14537242378988]
Compared with the traditional RGB object tracking, the addition of the depth modality can effectively solve the target and background interference. Some existing RGBD trackers use the two modalities separately and thus some particularly useful shared information between them is ignored. We propose a novel Dual-fused Modality-aware Tracker (termed DMTracker) which aims to learn informative and discriminative representations of the target objects for robust RGBD tracking.
arXiv Detail & Related papers (2022-11-06T07:59:07Z)
RGBD Object Tracking: An In-depth Review [89.96221353160831]
We firstly review RGBD object trackers from different perspectives, including RGBD fusion, depth usage, and tracking framework. We benchmark a representative set of RGBD trackers, and give detailed analyses based on their performances.
arXiv Detail & Related papers (2022-03-26T18:53:51Z)
DepthTrack : Unveiling the Power of RGBD Tracking [29.457114656913944]
This work introduces a new RGBD tracking dataset - Depth-Track. It has twice as many sequences (200) and scene types (40) than in the largest existing dataset. The average length of the sequences (1473), the number of deformable objects (16) and the number of tracking attributes (15) have been increased.
arXiv Detail & Related papers (2021-08-31T16:42:38Z)
Monocular Quasi-Dense 3D Object Tracking [99.51683944057191]
A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving. We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform.
arXiv Detail & Related papers (2021-03-12T15:30:02Z)
Relation3DMOT: Exploiting Deep Affinity for 3D Multi-Object Tracking from View Aggregation [8.854112907350624]
3D multi-object tracking plays a vital role in autonomous navigation. Many approaches detect objects in 2D RGB sequences for tracking, which is lack of reliability when localizing objects in 3D space. We propose a novel convolutional operation, named RelationConv, to better exploit the correlation between each pair of objects in the adjacent frames.
arXiv Detail & Related papers (2020-11-25T16:14:40Z)
DMD: A Large-Scale Multi-Modal Driver Monitoring Dataset for Attention and Alertness Analysis [54.198237164152786]
Vision is the richest and most cost-effective technology for Driver Monitoring Systems (DMS) The lack of sufficiently large and comprehensive datasets is currently a bottleneck for the progress of DMS development. In this paper, we introduce the Driver Monitoring dataset (DMD), an extensive dataset which includes real and simulated driving scenarios.
arXiv Detail & Related papers (2020-08-27T12:33:54Z)
Is Depth Really Necessary for Salient Object Detection? [50.10888549190576]
We make the first attempt in realizing an unified depth-aware framework with only RGB information as input for inference. Not only surpasses the state-of-the-art performances on five public RGB SOD benchmarks, but also surpasses the RGBD-based methods on five benchmarks by a large margin.
arXiv Detail & Related papers (2020-05-30T13:40:03Z)
JRMOT: A Real-Time 3D Multi-Object Tracker and a New Large-Scale Dataset [34.609125601292]
We present JRMOT, a novel 3D MOT system that integrates information from RGB images and 3D point clouds to achieve real-time tracking performance. As part of our work, we release the JRDB dataset, a novel large scale 2D+3D dataset and benchmark. The presented 3D MOT system demonstrates state-of-the-art performance against competing methods on the popular 2D tracking KITTI benchmark.
arXiv Detail & Related papers (2020-02-19T19:21:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.