Related papers: ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

URL: http://arxiv.org/abs/2508.05221v1
Date: Thu, 07 Aug 2025 10:02:07 GMT
Title: ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking
Authors: Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang,
Abstract summary: This paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL.<n>We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features.<n>In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences.
Score: 18.491855733401742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack

Related papers

ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking [0.6143225301480709]
Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame.<n>To achieve robust tracking, it is essential not only to characterize the target features but also to utilize the context features related to the target.<n>We present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states.
arXiv Detail & Related papers (2025-07-26T09:05:12Z)
CLDTracker: A Comprehensive Language Description for Visual Tracking [17.858934583542325]
We propose CLDTracker, a novel Comprehensive Language Description framework for robust visual tracking.<n>Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch.<n> Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance.
arXiv Detail & Related papers (2025-05-29T17:39:30Z)
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [95.6266030753644]
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions.<n>Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies.<n>We propose OTTER, a novel VLA architecture that leverages existing alignments through explicit, text-aware visual feature extraction.
arXiv Detail & Related papers (2025-03-05T18:44:48Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications.<n>VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance.<n>We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z)
Multi-Granularity Language-Guided Training for Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.<n>At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.<n>Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model [20.007650672107566]
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos. Recent methods track the zero-shot results of state-of-the-art image text spotters directly. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements.
arXiv Detail & Related papers (2024-05-29T15:35:09Z)
Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)
Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z)
Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark [46.691218019908746]
Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. We propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset. We also introduce two new challenges into TNL2K for the object tracking task, i.e., adversarial samples and modality switch.
arXiv Detail & Related papers (2021-03-31T00:57:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.