Related papers: Type-to-Track: Retrieve Any Object via Prompt-based Tracking

Type-to-Track: Retrieve Any Object via Prompt-based Tracking

URL: http://arxiv.org/abs/2305.13495v3
Date: Sat, 30 Sep 2023 18:58:41 GMT
Title: Type-to-Track: Retrieve Any Object via Prompt-based Tracking
Authors: Pha Nguyen, Kha Gia Quach, Kris Kitani, Khoa Luu
Abstract summary: This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track. Type-to-Track allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT.
Score: 34.859061177766016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One of the recent trends in vision problems is to use natural language captions to describe the objects of interest. This approach can overcome some limitations of traditional methods that rely on bounding boxes or category annotations. This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track, which allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT, that contains videos with various types of objects and their corresponding textual captions describing their appearance and action in detail. Additionally, we introduce two new evaluation protocols and formulate evaluation metrics specifically for this task. We develop a new efficient method that models a transformer-based eMbed-ENcoDE-extRact framework (MENDER) using the third-order tensor decomposition. The experiments in five scenarios show that our MENDER approach outperforms another two-stage design in terms of accuracy and efficiency, up to 14.7% accuracy and 4$\times$ speed faster.

Related papers

Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework. CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks. Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z)
InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios. Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT) Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z)
TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT [0.0]
Multi-Object Tracking (MOT) has made substantial advancements, but it is limited by heavy reliance on prior knowledge. Generic Multiple Object Tracking (GMOT), tracking multiple objects with similar appearance, requires less prior information about the targets. We introduce a novel text prompt-based open-vocabulary GMOT framework, called textbftextTP-GMOT. Our contributions are benchmarked on the textRefer-GMOT dataset for GMOT task.
arXiv Detail & Related papers (2024-09-04T07:33:09Z)
Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
Exploring Robust Features for Few-Shot Object Detection in Satellite Imagery [17.156864650143678]
We develop a few-shot object detector based on a traditional two-stage architecture. A large-scale pre-trained model is used to build class-reference embeddings or prototypes. We perform evaluations on two remote sensing datasets containing challenging and rare objects.
arXiv Detail & Related papers (2024-03-08T15:20:27Z)
OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes. It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z)
Open-Vocabulary Object Detection using Pseudo Caption Labels [3.260777306556596]
We argue that more fine-grained labels are necessary to extract richer knowledge about novel objects. Our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance.
arXiv Detail & Related papers (2023-03-23T05:10:22Z)
Referring Multi-Object Tracking [78.63827591797124]
We propose a new and general referring understanding task, termed referring multi-object tracking (RMOT) Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos.
arXiv Detail & Related papers (2023-03-06T18:50:06Z)
Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation. We introduce a scalable pipeline for generating synthetic training data with multiple objects. We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z)
Interactive Multi-Class Tiny-Object Detection [11.243831167773678]
We propose a novel interactive annotation method for multiple instances of tiny objects from multiple classes. Our approach, C3Det, relates the full image context with annotator inputs in a local and global manner. Our approach outperforms existing approaches in interactive annotation, achieving higher mAP with fewer clicks.
arXiv Detail & Related papers (2022-03-29T06:27:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.