Related papers: ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

URL: http://arxiv.org/abs/2507.19875v1
Date: Sat, 26 Jul 2025 09:05:12 GMT
Title: ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking
Authors: X. Feng, S. Hu, X. Li, D. Zhang, M. Wu, J. Zhang, X. Chen, K. Huang,
Abstract summary: Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame.<n>To achieve robust tracking, it is essential not only to characterize the target features but also to utilize the context features related to the target.<n>We present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states.
Score: 0.6143225301480709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: https://github.com/XiaokunFeng/ATCTrack.

Related papers

ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking [18.491855733401742]
This paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL.<n>We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features.<n>In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences.
arXiv Detail & Related papers (2025-08-07T10:02:07Z)
CLDTracker: A Comprehensive Language Description for Visual Tracking [17.858934583542325]
We propose CLDTracker, a novel Comprehensive Language Description framework for robust visual tracking.<n>Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch.<n> Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance.
arXiv Detail & Related papers (2025-05-29T17:39:30Z)
Dynamic Updates for Language Adaptation in Visual-Language Tracking [10.64409248365897]
We propose a vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency.<n>DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_rmext$, TNL2K, OTB99-Lang, GOT-10K, and UAV123.
arXiv Detail & Related papers (2025-03-09T13:47:19Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Context-Aware Integration of Language and Visual References for Natural Language Tracking [27.3884348078998]
Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. We propose a joint multi-modal tracking framework with 1) a prompt module to leverage the complement between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues. This design ensures-temporal consistency by leveraging historical visual information and an integrated solution, generating predictions in a single step.
arXiv Detail & Related papers (2024-03-29T04:58:33Z)
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z)
Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach. We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z)
VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search [51.9899504535878]
We propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search. In VGSG, a vision-guided attention is employed to extract visual-related textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features.
arXiv Detail & Related papers (2023-11-13T17:56:54Z)
Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)
CiteTracker: Correlating Image and Text for Visual Tracking [114.48653709286629]
We propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text. Specifically, we develop a text generation module to convert the target image patch into a descriptive text. We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference.
arXiv Detail & Related papers (2023-08-22T09:53:12Z)
Tracking Objects and Activities with Attention for Temporal Sentence Grounding [51.416914256782505]
Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment. We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
arXiv Detail & Related papers (2023-02-21T16:42:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.