Type-to-Track: Retrieve Any Object via Prompt-based Tracking
- URL: http://arxiv.org/abs/2305.13495v3
- Date: Sat, 30 Sep 2023 18:58:41 GMT
- Title: Type-to-Track: Retrieve Any Object via Prompt-based Tracking
- Authors: Pha Nguyen, Kha Gia Quach, Kris Kitani, Khoa Luu
- Abstract summary: This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track.
Type-to-Track allows users to track objects in videos by typing natural language descriptions.
We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT.
- Score: 34.859061177766016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the recent trends in vision problems is to use natural language
captions to describe the objects of interest. This approach can overcome some
limitations of traditional methods that rely on bounding boxes or category
annotations. This paper introduces a novel paradigm for Multiple Object
Tracking called Type-to-Track, which allows users to track objects in videos by
typing natural language descriptions. We present a new dataset for that
Grounded Multiple Object Tracking task, called GroOT, that contains videos with
various types of objects and their corresponding textual captions describing
their appearance and action in detail. Additionally, we introduce two new
evaluation protocols and formulate evaluation metrics specifically for this
task. We develop a new efficient method that models a transformer-based
eMbed-ENcoDE-extRact framework (MENDER) using the third-order tensor
decomposition. The experiments in five scenarios show that our MENDER approach
outperforms another two-stage design in terms of accuracy and efficiency, up to
14.7% accuracy and 4$\times$ speed faster.
Related papers
- InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios.
Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z) - VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT)
Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens.
We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z) - TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT [0.0]
Multi-Object Tracking (MOT) has made substantial advancements, but it is limited by heavy reliance on prior knowledge.
Generic Multiple Object Tracking (GMOT), tracking multiple objects with similar appearance, requires less prior information about the targets.
We introduce a novel text prompt-based open-vocabulary GMOT framework, called textbftextTP-GMOT.
Our contributions are benchmarked on the textRefer-GMOT dataset for GMOT task.
arXiv Detail & Related papers (2024-09-04T07:33:09Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - Exploring Robust Features for Few-Shot Object Detection in Satellite
Imagery [17.156864650143678]
We develop a few-shot object detector based on a traditional two-stage architecture.
A large-scale pre-trained model is used to build class-reference embeddings or prototypes.
We perform evaluations on two remote sensing datasets containing challenging and rare objects.
arXiv Detail & Related papers (2024-03-08T15:20:27Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - Open-Vocabulary Object Detection using Pseudo Caption Labels [3.260777306556596]
We argue that more fine-grained labels are necessary to extract richer knowledge about novel objects.
Our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance.
arXiv Detail & Related papers (2023-03-23T05:10:22Z) - Referring Multi-Object Tracking [78.63827591797124]
We propose a new and general referring understanding task, termed referring multi-object tracking (RMOT)
Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking.
To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos.
arXiv Detail & Related papers (2023-03-06T18:50:06Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Interactive Multi-Class Tiny-Object Detection [11.243831167773678]
We propose a novel interactive annotation method for multiple instances of tiny objects from multiple classes.
Our approach, C3Det, relates the full image context with annotator inputs in a local and global manner.
Our approach outperforms existing approaches in interactive annotation, achieving higher mAP with fewer clicks.
arXiv Detail & Related papers (2022-03-29T06:27:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.