OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer
- URL: http://arxiv.org/abs/2503.10616v3
- Date: Sun, 30 Mar 2025 17:15:53 GMT
- Title: OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer
- Authors: Jinyang Li, En Yu, Sijia Chen, Wenbing Tao,
- Abstract summary: Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training.<n>OVTR is the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously.
- Score: 25.963586473288764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker is constrained by its framework structure, isolated frame-level perception, and insufficient modal interactions, which hinder its performance in open-vocabulary classification and tracking. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we design the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Experimental results show that our method surpasses previous trackers on the open-vocabulary MOT benchmark while also achieving faster inference speeds and significantly reducing preprocessing requirements. Moreover, the experiment transferring the model to another dataset demonstrates its strong adaptability. Models and code are released at https://github.com/jinyanglii/OVTR.
Related papers
- Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking [23.65057966356924]
OV-MOT aims to enable approaches to track objects without being limited to a predefined set of categories.<n>We propose textbfTRACT, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT.
arXiv Detail & Related papers (2025-03-11T08:03:47Z) - IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking [13.977088329815933]
Multi-Object Tracking (MOT) aims to associate multiple objects across video frames.
Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability.
We develop IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions.
arXiv Detail & Related papers (2024-10-30T14:24:56Z) - Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition [14.97527336050901]
We propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR)
It incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings.
Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins.
arXiv Detail & Related papers (2024-08-22T15:13:27Z) - CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation [18.276988929148143]
We explore the potential of large-scale noisily labeled data to enhance feature learning by pretraining semantic segmentation models.<n>Unlike conventional pretraining approaches, CromSS exploits massive amounts of noisy and easy-to-come-by labels for improved feature learning.
arXiv Detail & Related papers (2024-05-02T11:58:06Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task.
A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks.
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - Self-Supervised Representation Learning from Temporal Ordering of
Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks.
We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems.
Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z) - Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video
Relation Detection [67.64272825961395]
We present Relation Prompt (RePro) for Open-vocabulary Video Visual Relation Detection (Open-VidVRD)
RePro addresses the two technical challenges of Open-VidVRD: 1) the prompt tokens should respect the two different semantic roles of subject and object, and 2) the tuning should account for the diverse predicate-temporal motion patterns of the subject-object compositions.
arXiv Detail & Related papers (2023-02-01T06:20:54Z) - Backbone is All Your Need: A Simplified Architecture for Visual Object
Tracking [69.08903927311283]
Existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection.
This paper presents a simplified tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction.
Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles.
arXiv Detail & Related papers (2022-03-10T12:20:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.