Related papers: Towards Unified Token Learning for Vision-Language Tracking

Towards Unified Token Learning for Vision-Language Tracking

URL: http://arxiv.org/abs/2308.14103v1
Date: Sun, 27 Aug 2023 13:17:34 GMT
Title: Towards Unified Token Learning for Vision-Language Tracking
Authors: Yaozong Zheng and Bineng Zhong and Qihua Liang and Guorong Li and Rongrong Ji and Xianxian Li
Abstract summary: We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
Score: 65.96561538356315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present a simple, flexible and effective vision-language (VL) tracking pipeline, termed \textbf{MMTrack}, which casts VL tracking as a token generation task. Traditional paradigms address VL tracking task indirectly with sophisticated prior designs, making them over-specialize on the features of specific architectures or mechanisms. In contrast, our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target in an auto-regressive manner. The design without other prior modules avoids multiple sub-tasks learning and hand-designed loss functions, significantly reducing the complexity of VL tracking modeling and allowing our tracker to use a simple cross-entropy loss as unified optimization objective for VL tracking task. Extensive experiments on TNL2K, LaSOT, LaSOT$_{\rm{ext}}$ and OTB99-Lang benchmarks show that our approach achieves promising results, compared to other state-of-the-arts.

Related papers

Hierarchical Instruction-aware Embodied Visual Tracking [35.73851196966425]
User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models.<n>We propose textbf Instruction-aware Embodied Visual Tracking (HIEVT) agent, which bridges instruction comprehension and action generation using textitspatial goals as intermediaries.
arXiv Detail & Related papers (2025-05-27T04:36:26Z)
COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking [52.62149024881728]
We propose a contrastive one-stage transformer fusion framework for vision-language (VL) tracking. We introduce a contrastive alignment strategy that maximizes mutual information between a video and its corresponding language description. By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism.
arXiv Detail & Related papers (2025-04-02T03:12:38Z)
Less is More: Token Context-aware Learning for Object Tracking [20.222950380244377]
LMTrack is a token context-aware tracking pipeline. It automatically learns high-quality reference tokens for efficient visual tracking. It achieves state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.
arXiv Detail & Related papers (2025-01-01T07:05:31Z)
SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking [34.90147791481045]
SynCL is a novel plug-and-play synergistic training strategy designed to co-facilitate multi-task learning for detection and tracking. We show that SynCL consistently delivers improvements when integrated with the training stage of various query-based 3D MOT trackers. Without additional inference costs, SynCL improves the state-of-the-art PF-Track method by $+3.9%$ AMOTA and $+2.0%$ NDS on the nuScenes dataset.
arXiv Detail & Related papers (2024-11-11T08:18:49Z)
Hierarchical IoU Tracking based on Interval [21.555469501789577]
Multi-Object Tracking (MOT) aims to detect and associate all targets of given classes across frames. We propose the Hierarchical IoU Tracking framework, dubbed HIT, which achieves unified hierarchical tracking by utilizing tracklet intervals as priors. Our method achieves promising performance on four datasets, i.e., MOT17, KITTI, DanceTrack and VisDrone.
arXiv Detail & Related papers (2024-06-19T07:03:18Z)
Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach. We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z)
Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets. Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z)
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment [23.486297020327257]
Current vision-language (VL) tracking framework consists of three parts, ie a visual feature extractor, a language feature extractor, and a fusion model. We propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone.
arXiv Detail & Related papers (2023-07-07T03:51:21Z)
End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time. Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z)
Towards Sequence-Level Training for Visual Tracking [60.95799261482857]
This work introduces a sequence-level training strategy for visual tracking based on reinforcement learning. Four representative tracking models, SiamRPN++, SiamAttn, TransT, and TrDiMP, consistently improve by incorporating the proposed methods in training.
arXiv Detail & Related papers (2022-08-11T13:15:36Z)
Transformer-based assignment decision network for multiple object tracking [0.0]
We introduce Transformer-based Assignment Decision Network (TADN) that tackles data association without the need of explicit optimization during inference. Our proposed approach outperforms the state-of-the-art in most evaluation metrics despite its simple nature as a tracker.
arXiv Detail & Related papers (2022-08-06T19:47:32Z)
X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries. We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP. We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z)
A Unified Object Motion and Affinity Model for Online Multi-Object Tracking [127.5229859255719]
We propose a novel MOT framework that unifies object motion and affinity model into a single network, named UMA. UMA integrates single object tracking and metric learning into a unified triplet network by means of multi-task learning. We equip our model with a task-specific attention module, which is used to boost task-aware feature learning.
arXiv Detail & Related papers (2020-03-25T09:36:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.