VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking
- URL: http://arxiv.org/abs/2410.08529v1
- Date: Fri, 11 Oct 2024 05:01:49 GMT
- Title: VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking
- Authors: Zekun Qian, Ruize Han, Junhui Hou, Linqi Song, Wei Feng,
- Abstract summary: This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT)
Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens.
We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
- Score: 61.56592503861093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.
Related papers
- Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models [28.304047711166056]
Large-scale pre-trained models have shown promising advances in detecting and segmenting objects in 2D static images in the wild.
This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking?
In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos.
arXiv Detail & Related papers (2023-10-10T20:25:30Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - Unifying Tracking and Image-Video Object Detection [54.91658924277527]
TrIVD (Tracking and Image-Video Detection) is the first framework that unifies image OD, video OD, and MOT within one end-to-end model.
To handle the discrepancies and semantic overlaps of category labels, TrIVD formulates detection/tracking as grounding and reasons about object categories.
arXiv Detail & Related papers (2022-11-20T20:30:28Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Learning to Track Object Position through Occlusion [32.458623495840904]
Occlusion is one of the most significant challenges encountered by object detectors and trackers.
We propose a tracking-by-detection approach that builds upon the success of region based video object detectors.
Our approach achieves superior results on a dataset of furniture assembly videos collected from the internet.
arXiv Detail & Related papers (2021-06-20T22:29:46Z) - TrackFormer: Multi-Object Tracking with Transformers [92.25832593088421]
TrackFormer is an end-to-end multi-object tracking and segmentation model based on an encoder-decoder Transformer architecture.
New track queries are spawned by the DETR object detector and embed the position of their corresponding object over time.
TrackFormer achieves a seamless data association between frames in a new tracking-by-attention paradigm.
arXiv Detail & Related papers (2021-01-07T18:59:29Z) - IA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency [40.354708148590696]
"instance-aware MOT" (IA-MOT) can track multiple objects in either static or moving cameras.
Our proposed method won the first place in Track 3 of the BMTT Challenge in CVPR 2020 workshops.
arXiv Detail & Related papers (2020-06-24T03:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.