Dynamic Updates for Language Adaptation in Visual-Language Tracking
- URL: http://arxiv.org/abs/2503.06621v1
- Date: Sun, 09 Mar 2025 13:47:19 GMT
- Title: Dynamic Updates for Language Adaptation in Visual-Language Tracking
- Authors: Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, Shuxiang Song,
- Abstract summary: We propose a vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency.<n>DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_rmext$, TNL2K, OTB99-Lang, GOT-10K, and UAV123.
- Score: 10.64409248365897
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at https://github.com/GXNU-ZhongLab/DUTrack.
Related papers
- Re-Aligning Language to Visual Objects with an Agentic Workflow [73.73778652260911]
Language-based object detection aims to align visual objects with language expressions.
Recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects.
We propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts.
arXiv Detail & Related papers (2025-03-30T16:41:12Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications.<n>VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance.<n>We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - MLS-Track: Multilevel Semantic Interaction in RMOT [31.153018571396206]
We propose a high-quality yet low-cost data generation method base on Unreal Engine 5.
We construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos.
We also propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer.
arXiv Detail & Related papers (2024-04-18T09:31:03Z) - Tracking with Human-Intent Reasoning [64.69229729784008]
This work proposes a new tracking task -- Instruction Tracking.
It involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames.
TrackGPT is capable of performing complex reasoning-based tracking.
arXiv Detail & Related papers (2023-12-29T03:22:18Z) - CiteTracker: Correlating Image and Text for Visual Tracking [114.48653709286629]
We propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text.
Specifically, we develop a text generation module to convert the target image patch into a descriptive text.
We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference.
arXiv Detail & Related papers (2023-08-22T09:53:12Z) - Type-to-Track: Retrieve Any Object via Prompt-based Tracking [34.859061177766016]
This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track.
Type-to-Track allows users to track objects in videos by typing natural language descriptions.
We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT.
arXiv Detail & Related papers (2023-05-22T21:25:27Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - Generalizing Multiple Object Tracking to Unseen Domains by Introducing
Natural Language Representation [33.03600813115465]
We propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability.
To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM)
VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description.
Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.
arXiv Detail & Related papers (2022-12-03T07:57:31Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.