Towards Unified Token Learning for Vision-Language Tracking
- URL: http://arxiv.org/abs/2308.14103v1
- Date: Sun, 27 Aug 2023 13:17:34 GMT
- Title: Towards Unified Token Learning for Vision-Language Tracking
- Authors: Yaozong Zheng and Bineng Zhong and Qihua Liang and Guorong Li and
Rongrong Ji and Xianxian Li
- Abstract summary: We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
- Score: 65.96561538356315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a simple, flexible and effective vision-language
(VL) tracking pipeline, termed \textbf{MMTrack}, which casts VL tracking as a
token generation task. Traditional paradigms address VL tracking task
indirectly with sophisticated prior designs, making them over-specialize on the
features of specific architectures or mechanisms. In contrast, our proposed
framework serializes language description and bounding box into a sequence of
discrete tokens. In this new design paradigm, all token queries are required to
perceive the desired target and directly predict spatial coordinates of the
target in an auto-regressive manner. The design without other prior modules
avoids multiple sub-tasks learning and hand-designed loss functions,
significantly reducing the complexity of VL tracking modeling and allowing our
tracker to use a simple cross-entropy loss as unified optimization objective
for VL tracking task. Extensive experiments on TNL2K, LaSOT, LaSOT$_{\rm{ext}}$
and OTB99-Lang benchmarks show that our approach achieves promising results,
compared to other state-of-the-arts.
Related papers
- Hierarchical IoU Tracking based on Interval [21.555469501789577]
Multi-Object Tracking (MOT) aims to detect and associate all targets of given classes across frames.
We propose the Hierarchical IoU Tracking framework, dubbed HIT, which achieves unified hierarchical tracking by utilizing tracklet intervals as priors.
Our method achieves promising performance on four datasets, i.e., MOT17, KITTI, DanceTrack and VisDrone.
arXiv Detail & Related papers (2024-06-19T07:03:18Z) - Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for
Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach.
We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - All in One: Exploring Unified Vision-Language Tracking with Multi-Modal
Alignment [23.486297020327257]
Current vision-language (VL) tracking framework consists of three parts, ie a visual feature extractor, a language feature extractor, and a fusion model.
We propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone.
arXiv Detail & Related papers (2023-07-07T03:51:21Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Towards Sequence-Level Training for Visual Tracking [60.95799261482857]
This work introduces a sequence-level training strategy for visual tracking based on reinforcement learning.
Four representative tracking models, SiamRPN++, SiamAttn, TransT, and TrDiMP, consistently improve by incorporating the proposed methods in training.
arXiv Detail & Related papers (2022-08-11T13:15:36Z) - Transformer-based assignment decision network for multiple object
tracking [0.0]
We introduce Transformer-based Assignment Decision Network (TADN) that tackles data association without the need of explicit optimization during inference.
Our proposed approach outperforms the state-of-the-art in most evaluation metrics despite its simple nature as a tracker.
arXiv Detail & Related papers (2022-08-06T19:47:32Z) - X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented
Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries.
We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP.
We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z) - A Unified Object Motion and Affinity Model for Online Multi-Object
Tracking [127.5229859255719]
We propose a novel MOT framework that unifies object motion and affinity model into a single network, named UMA.
UMA integrates single object tracking and metric learning into a unified triplet network by means of multi-task learning.
We equip our model with a task-specific attention module, which is used to boost task-aware feature learning.
arXiv Detail & Related papers (2020-03-25T09:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.