Towards Unified Token Learning for Vision-Language Tracking
- URL: http://arxiv.org/abs/2308.14103v1
- Date: Sun, 27 Aug 2023 13:17:34 GMT
- Title: Towards Unified Token Learning for Vision-Language Tracking
- Authors: Yaozong Zheng and Bineng Zhong and Qihua Liang and Guorong Li and
Rongrong Ji and Xianxian Li
- Abstract summary: We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
- Score: 65.96561538356315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a simple, flexible and effective vision-language
(VL) tracking pipeline, termed \textbf{MMTrack}, which casts VL tracking as a
token generation task. Traditional paradigms address VL tracking task
indirectly with sophisticated prior designs, making them over-specialize on the
features of specific architectures or mechanisms. In contrast, our proposed
framework serializes language description and bounding box into a sequence of
discrete tokens. In this new design paradigm, all token queries are required to
perceive the desired target and directly predict spatial coordinates of the
target in an auto-regressive manner. The design without other prior modules
avoids multiple sub-tasks learning and hand-designed loss functions,
significantly reducing the complexity of VL tracking modeling and allowing our
tracker to use a simple cross-entropy loss as unified optimization objective
for VL tracking task. Extensive experiments on TNL2K, LaSOT, LaSOT$_{\rm{ext}}$
and OTB99-Lang benchmarks show that our approach achieves promising results,
compared to other state-of-the-arts.
Related papers
- Less is More: Token Context-aware Learning for Object Tracking [20.222950380244377]
LMTrack is a token context-aware tracking pipeline.
It automatically learns high-quality reference tokens for efficient visual tracking.
It achieves state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.
arXiv Detail & Related papers (2025-01-01T07:05:31Z) - SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking [34.90147791481045]
SynCL is a novel plug-and-play synergistic training strategy designed to co-facilitate multi-task learning for detection and tracking.
We show that SynCL consistently delivers improvements when integrated with the training stage of various query-based 3D MOT trackers.
Without additional inference costs, SynCL improves the state-of-the-art PF-Track method by $+3.9%$ AMOTA and $+2.0%$ NDS on the nuScenes dataset.
arXiv Detail & Related papers (2024-11-11T08:18:49Z) - Hierarchical IoU Tracking based on Interval [21.555469501789577]
Multi-Object Tracking (MOT) aims to detect and associate all targets of given classes across frames.
We propose the Hierarchical IoU Tracking framework, dubbed HIT, which achieves unified hierarchical tracking by utilizing tracklet intervals as priors.
Our method achieves promising performance on four datasets, i.e., MOT17, KITTI, DanceTrack and VisDrone.
arXiv Detail & Related papers (2024-06-19T07:03:18Z) - Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for
Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach.
We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - All in One: Exploring Unified Vision-Language Tracking with Multi-Modal
Alignment [23.486297020327257]
Current vision-language (VL) tracking framework consists of three parts, ie a visual feature extractor, a language feature extractor, and a fusion model.
We propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone.
arXiv Detail & Related papers (2023-07-07T03:51:21Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Towards Sequence-Level Training for Visual Tracking [60.95799261482857]
This work introduces a sequence-level training strategy for visual tracking based on reinforcement learning.
Four representative tracking models, SiamRPN++, SiamAttn, TransT, and TrDiMP, consistently improve by incorporating the proposed methods in training.
arXiv Detail & Related papers (2022-08-11T13:15:36Z) - X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented
Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries.
We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP.
We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z) - A Unified Object Motion and Affinity Model for Online Multi-Object
Tracking [127.5229859255719]
We propose a novel MOT framework that unifies object motion and affinity model into a single network, named UMA.
UMA integrates single object tracking and metric learning into a unified triplet network by means of multi-task learning.
We equip our model with a task-specific attention module, which is used to boost task-aware feature learning.
arXiv Detail & Related papers (2020-03-25T09:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.