Towards More Flexible and Accurate Object Tracking with Natural
Language: Algorithms and Benchmark
- URL: http://arxiv.org/abs/2103.16746v1
- Date: Wed, 31 Mar 2021 00:57:32 GMT
- Title: Towards More Flexible and Accurate Object Tracking with Natural
Language: Algorithms and Benchmark
- Authors: Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong
Tian, Feng Wu
- Abstract summary: Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description.
We propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset.
We also introduce two new challenges into TNL2K for the object tracking task, i.e., adversarial samples and modality switch.
- Score: 46.691218019908746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tracking by natural language specification is a new rising research topic
that aims at locating the target object in the video sequence based on its
language description. Compared with traditional bounding box (BBox) based
tracking, this setting guides object tracking with high-level semantic
information, addresses the ambiguity of BBox, and links local and global search
organically together. Those benefits may bring more flexible, robust and
accurate tracking performance in practical scenarios. However, existing natural
language initialized trackers are developed and compared on benchmark datasets
proposed for tracking-by-BBox, which can't reflect the true power of
tracking-by-language. In this work, we propose a new benchmark specifically
dedicated to the tracking-by-language, including a large scale dataset, strong
and diverse baseline methods. Specifically, we collect 2k video sequences
(contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the
train/testing respectively. We densely annotate one sentence in English and
corresponding bounding boxes of the target object for each video. We also
introduce two new challenges into TNL2K for the object tracking task, i.e.,
adversarial samples and modality switch. A strong baseline method based on an
adaptive local-global-search scheme is proposed for future works to compare. We
believe this benchmark will greatly boost related researches on natural
language guided tracking.
Related papers
- DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM [23.551036494221222]
We propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks.
We offer four texts in our benchmark, considering the extent and density of semantic information.
We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance.
arXiv Detail & Related papers (2024-10-03T13:57:07Z) - Bootstrapping Referring Multi-Object Tracking [14.46285727127232]
Referring multi-object tracking (RMOT) aims at detecting and tracking multiple objects following human instruction represented by a natural language expression.
Our key idea is to bootstrap the task of referring multi-object tracking by introducing discriminative language words.
arXiv Detail & Related papers (2024-06-07T16:02:10Z) - Unifying Visual and Vision-Language Tracking via Contrastive Learning [34.49865598433915]
Single object tracking aims to locate the target object in a video sequence according to different modal references.
Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings.
We present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings.
arXiv Detail & Related papers (2024-01-20T13:20:54Z) - Tracking with Human-Intent Reasoning [64.69229729784008]
This work proposes a new tracking task -- Instruction Tracking.
It involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames.
TrackGPT is capable of performing complex reasoning-based tracking.
arXiv Detail & Related papers (2023-12-29T03:22:18Z) - Joint Visual Grounding and Tracking with Natural Language Specification [6.695284124073918]
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description.
We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task.
Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
arXiv Detail & Related papers (2023-03-21T17:09:03Z) - Referring Multi-Object Tracking [78.63827591797124]
We propose a new and general referring understanding task, termed referring multi-object tracking (RMOT)
Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking.
To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos.
arXiv Detail & Related papers (2023-03-06T18:50:06Z) - Tracking by Joint Local and Global Search: A Target-aware Attention
based Approach [63.50045332644818]
We propose a novel target-aware attention mechanism (termed TANet) to conduct joint local and global search for robust tracking.
Specifically, we extract the features of target object patch and continuous video frames, then we track and feed them into a decoder network to generate target-aware global attention maps.
In the tracking procedure, we integrate the target-aware attention with multiple trackers by exploring candidate search regions for robust tracking.
arXiv Detail & Related papers (2021-06-09T06:54:15Z) - LaSOT: A High-quality Large-scale Single Object Tracking Benchmark [67.96196486540497]
We present LaSOT, a high-quality Large-scale Single Object Tracking benchmark.
LaSOT contains a diverse selection of 85 object classes, and offers 1,550 totaling more than 3.87 million frames.
Each video frame is carefully and manually annotated with a bounding box. This makes LaSOT, to our knowledge, the largest densely annotated tracking benchmark.
arXiv Detail & Related papers (2020-09-08T00:31:56Z) - TAO: A Large-Scale Benchmark for Tracking Any Object [95.87310116010185]
Tracking Any Object dataset consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
We ask annotators to label objects that move at any point in the video, and give names to them post factum.
Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets.
arXiv Detail & Related papers (2020-05-20T21:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.