ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking
- URL: http://arxiv.org/abs/2603.05384v1
- Date: Thu, 05 Mar 2026 17:15:01 GMT
- Title: ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking
- Authors: Sijia Chen, Zihan Zhou, Yanqiu Yu, En Yu, Wenbing Tao,
- Abstract summary: Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames.<n>We propose Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery.<n>We construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects.
- Score: 39.56214494580301
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.
Related papers
- ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking [23.76697700853566]
We propose a new task, called Reasoning-based Multi-Object Tracking (ReaMOT)<n>ReaMOT is a more challenging task that requires accurate reasoning about objects that match the language instruction with reasoning characteristic and tracking the objects' trajectories.<n>We construct ReaMOT Challenge, a reasoning-based multi-object tracking benchmark built upon 12 datasets.
arXiv Detail & Related papers (2025-05-26T17:55:19Z) - Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework.<n>CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks.<n>Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z) - MITracker: Multi-View Integration for Visual Object Tracking [15.713725317019321]
We develop a novel MVOT method, Multi-View Integration Tracker (MITracker), to efficiently integrate multi-view object features.<n>MITracker can track any object in video frames of arbitrary length from arbitrary viewpoints.<n>MITracker outperforms existing methods on the MVTrack and GMTD datasets, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-02-27T14:03:28Z) - Cross-View Referring Multi-Object Tracking [25.963714973838417]
Referring Multi-Object Tracking (RMOT) is an important topic in the current tracking field.<n>We propose a new task, called Cross-view Referring Multi-Object Tracking ( CRMOT)<n>It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task.
arXiv Detail & Related papers (2024-12-23T18:58:39Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - DIVOTrack: A Novel Dataset and Baseline Method for Cross-View
Multi-Object Tracking in DIVerse Open Scenes [74.64897845999677]
We introduce a new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians.
Our DIVOTrack has fifteen distinct scenarios and 953 cross-view tracks, surpassing all cross-view multi-object tracking datasets currently available.
Furthermore, we provide a novel baseline cross-view tracking method with a unified joint detection and cross-view tracking framework named CrossMOT.
arXiv Detail & Related papers (2023-02-15T14:10:42Z) - Unifying Tracking and Image-Video Object Detection [54.91658924277527]
TrIVD (Tracking and Image-Video Detection) is the first framework that unifies image OD, video OD, and MOT within one end-to-end model.
To handle the discrepancies and semantic overlaps of category labels, TrIVD formulates detection/tracking as grounding and reasons about object categories.
arXiv Detail & Related papers (2022-11-20T20:30:28Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - TAO: A Large-Scale Benchmark for Tracking Any Object [95.87310116010185]
Tracking Any Object dataset consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
We ask annotators to label objects that move at any point in the video, and give names to them post factum.
Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets.
arXiv Detail & Related papers (2020-05-20T21:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.