Tracking with Human-Intent Reasoning
- URL: http://arxiv.org/abs/2312.17448v1
- Date: Fri, 29 Dec 2023 03:22:18 GMT
- Title: Tracking with Human-Intent Reasoning
- Authors: Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan
Lu, Yifeng Geng, Xuansong Xie
- Abstract summary: This work proposes a new tracking task -- Instruction Tracking.
It involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames.
TrackGPT is capable of performing complex reasoning-based tracking.
- Score: 64.69229729784008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in perception modeling have significantly improved the performance
of object tracking. However, the current methods for specifying the target
object in the initial frame are either by 1) using a box or mask template, or
by 2) providing an explicit language description. These manners are cumbersome
and do not allow the tracker to have self-reasoning ability. Therefore, this
work proposes a new tracking task -- Instruction Tracking, which involves
providing implicit tracking instructions that require the trackers to perform
tracking automatically in video frames. To achieve this, we investigate the
integration of knowledge and reasoning capabilities from a Large
Vision-Language Model (LVLM) for object tracking. Specifically, we propose a
tracker called TrackGPT, which is capable of performing complex reasoning-based
tracking. TrackGPT first uses LVLM to understand tracking instructions and
condense the cues of what target to track into referring embeddings. The
perception component then generates the tracking results based on the
embeddings. To evaluate the performance of TrackGPT, we construct an
instruction tracking benchmark called InsTrack, which contains over one
thousand instruction-video pairs for instruction tuning and evaluation.
Experiments show that TrackGPT achieves competitive performance on referring
video object segmentation benchmarks, such as getting a new state-of the-art
performance of 66.5 $\mathcal{J}\&\mathcal{F}$ on Refer-DAVIS. It also
demonstrates a superior performance of instruction tracking under new
evaluation protocols. The code and models are available at
\href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}.
Related papers
- HSTrack: Bootstrap End-to-End Multi-Camera 3D Multi-object Tracking with Hybrid Supervision [34.7347336548199]
In camera-based 3D multi-object tracking (MOT), the prevailing methods follow the tracking-by-query-propagation paradigm.
We present HSTrack, a novel plug-and-play method designed to co-facilitate multi-task learning for detection and tracking.
arXiv Detail & Related papers (2024-11-11T08:18:49Z) - ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications.
VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance.
We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z) - OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning [33.521077115333696]
We present a general framework to unify various tracking tasks, termed as OneTracker.
OneTracker first performs a large-scale pre-training on a RGB tracker called Foundation Tracker.
Then we regard other modality information as prompt and build Prompt Tracker upon Foundation Tracker.
arXiv Detail & Related papers (2024-03-14T17:59:13Z) - CoTracker: It is Better to Track Together [70.63040730154984]
CoTracker is a transformer-based model that tracks a large number of 2D points in long video sequences.
We show that joint tracking significantly improves tracking accuracy and robustness, and allows CoTracker to track occluded points and points outside of the camera view.
arXiv Detail & Related papers (2023-07-14T21:13:04Z) - OmniTracker: Unifying Object Tracking by Tracking-with-Detection [119.51012668709502]
OmniTracker is presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline.
Experiments on 7 tracking datasets, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
arXiv Detail & Related papers (2023-03-21T17:59:57Z) - Context-aware Visual Tracking with Joint Meta-updating [11.226947525556813]
We propose a context-aware tracking model to optimize the tracker over the representation space, which jointly meta-update both branches by exploiting information along the whole sequence.
The proposed tracking method achieves an EAO score of 0.514 on VOT2018 with the speed of 40FPS, demonstrating its capability of improving the accuracy and robustness of the underlying tracker with little speed drop.
arXiv Detail & Related papers (2022-04-04T14:16:00Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - Unsupervised Deep Representation Learning for Real-Time Tracking [137.69689503237893]
We propose an unsupervised learning method for visual tracking.
The motivation of our unsupervised learning is that a robust tracker should be effective in bidirectional tracking.
We build our framework on a Siamese correlation filter network, and propose a multi-frame validation scheme and a cost-sensitive loss to facilitate unsupervised learning.
arXiv Detail & Related papers (2020-07-22T08:23:12Z) - Robust Visual Object Tracking with Two-Stream Residual Convolutional
Networks [62.836429958476735]
We propose a Two-Stream Residual Convolutional Network (TS-RCN) for visual tracking.
Our TS-RCN can be integrated with existing deep learning based visual trackers.
To further improve the tracking performance, we adopt a "wider" residual network ResNeXt as its feature extraction backbone.
arXiv Detail & Related papers (2020-05-13T19:05:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.