TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT
- URL: http://arxiv.org/abs/2409.02490v1
- Date: Wed, 4 Sep 2024 07:33:09 GMT
- Title: TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT
- Authors: Duy Le Dinh Anh, Kim Hoang Tran, Ngan Hoang Le,
- Abstract summary: Multi-Object Tracking (MOT) has made substantial advancements, but it is limited by heavy reliance on prior knowledge.
Generic Multiple Object Tracking (GMOT), tracking multiple objects with similar appearance, requires less prior information about the targets.
We introduce a novel text prompt-based open-vocabulary GMOT framework, called textbftextTP-GMOT.
Our contributions are benchmarked on the textRefer-GMOT dataset for GMOT task.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Multi-Object Tracking (MOT) has made substantial advancements, it is limited by heavy reliance on prior knowledge and limited to predefined categories. In contrast, Generic Multiple Object Tracking (GMOT), tracking multiple objects with similar appearance, requires less prior information about the targets but faces challenges with variants like viewpoint, lighting, occlusion, and resolution. Our contributions commence with the introduction of the \textbf{\text{Refer-GMOT dataset}} a collection of videos, each accompanied by fine-grained textual descriptions of their attributes. Subsequently, we introduce a novel text prompt-based open-vocabulary GMOT framework, called \textbf{\text{TP-GMOT}}, which can track never-seen object categories with zero training examples. Within \text{TP-GMOT} framework, we introduce two novel components: (i) {\textbf{\text{TP-OD}}, an object detection by a textual prompt}, for accurately detecting unseen objects with specific characteristics. (ii) Motion-Appearance Cost SORT \textbf{\text{MAC-SORT}}, a novel object association approach that adeptly integrates motion and appearance-based matching strategies to tackle the complex task of tracking multiple generic objects with high similarity. Our contributions are benchmarked on the \text{Refer-GMOT} dataset for GMOT task. Additionally, to assess the generalizability of the proposed \text{TP-GMOT} framework and the effectiveness of \text{MAC-SORT} tracker, we conduct ablation studies on the DanceTrack and MOT20 datasets for the MOT task. Our dataset, code, and models will be publicly available at: https://fsoft-aic.github.io/TP-GMOT
Related papers
- Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking [0.08333024746293495]
Grounded-GMOT is an innovative tracking paradigm that enables users to track multiple generic objects in videos through natural language descriptors.
Our contributions begin with the introduction of the G2MOT dataset, which includes a collection of videos featuring a wide variety of generic objects.
Following this, we propose a novel tracking method, KAM-SORT, which not only effectively integrates visual appearance with motion cues but also enhances the Kalman filter.
arXiv Detail & Related papers (2024-10-11T20:38:17Z) - OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios.
Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks.
In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z) - Siamese-DETR for Generic Multi-Object Tracking [16.853363984562602]
Traditional Multi-Object Tracking (MOT) is limited to tracking objects belonging to the pre-defined closed-set categories.
Siamese-DETR is proposed to track objects beyond pre-defined categories with the given text prompt and template image.
Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin.
arXiv Detail & Related papers (2023-10-27T03:32:05Z) - Follow Anything: Open-set detection, tracking, and following in
real-time [89.83421771766682]
We present a robotic system to detect, track, and follow any object in real-time.
Our approach, dubbed follow anything'' (FAn), is an open-vocabulary and multimodal model.
FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second.
arXiv Detail & Related papers (2023-08-10T17:57:06Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Z-GMOT: Zero-shot Generic Multiple Object Tracking [8.878331472995498]
Multi-Object Tracking (MOT) faces limitations such as reliance on prior knowledge and predefined categories.
To address these issues, Generic Multiple Object Tracking (GMOT) has emerged as an alternative approach.
We propose $mathttZ-GMOT$, a cutting-edge tracking solution capable of tracking objects from textitnever-seen categories without the need of initial bounding boxes or predefined categories.
arXiv Detail & Related papers (2023-05-28T06:44:33Z) - Type-to-Track: Retrieve Any Object via Prompt-based Tracking [34.859061177766016]
This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track.
Type-to-Track allows users to track objects in videos by typing natural language descriptions.
We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT.
arXiv Detail & Related papers (2023-05-22T21:25:27Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Contextual Text Block Detection towards Scene Text Understanding [85.40898487745272]
This paper presents contextual text detection, a new setup that detects contextual text blocks (CTBs) for better understanding of texts in scenes.
We formulate the new setup by a dual detection task which first detects integral text units and then groups them into a CTB.
To this end, we design a novel scene text clustering technique that treats integral text units as tokens and groups them (belonging to the same CTB) into an ordered token sequence.
arXiv Detail & Related papers (2022-07-26T14:59:25Z) - MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding [40.24656027709833]
We propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query.
We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model.
Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
arXiv Detail & Related papers (2021-04-26T17:55:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.