Related papers: EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking

EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking

URL: http://arxiv.org/abs/2510.13235v1
Date: Wed, 15 Oct 2025 07:39:30 GMT
Title: EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking
Authors: Yukuan Zhang, Jiarui Zhao, Shangqing Nie, Jin Kuang, Shengsheng Wang,
Abstract summary: We propose a unified vision-language tracking framework, named EPIPTrack.<n>EPIPTrack leverages explicit and implicit prompts for dynamic target modeling and semantic alignment.<n>Experiments on MOT17, MOT20, and Dance demonstrate that EPIPTrack outperforms existing trackers in diverse scenarios.
Score: 10.065921746316642
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal semantic cues, such as textual descriptions, have shown strong potential in enhancing target perception for tracking. However, existing methods rely on static textual descriptions from large language models, which lack adaptability to real-time target state changes and prone to hallucinations. To address these challenges, we propose a unified multimodal vision-language tracking framework, named EPIPTrack, which leverages explicit and implicit prompts for dynamic target modeling and semantic alignment. Specifically, explicit prompts transform spatial motion information into natural language descriptions to provide spatiotemporal guidance. Implicit prompts combine pseudo-words with learnable descriptors to construct individualized knowledge representations capturing appearance attributes. Both prompts undergo dynamic adjustment via the CLIP text encoder to respond to changes in target state. Furthermore, we design a Discriminative Feature Augmentor to enhance visual and cross-modal representations. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that EPIPTrack outperforms existing trackers in diverse scenarios, exhibiting robust adaptability and superior performance.

Related papers

Chatting with Images for Introspective Visual Thinking [50.7747647794877]
''Chatting with images'' is a new framework that reframes visual manipulation as language-guided feature modulation.<n>Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions.<n>ViLaVT achieves strong and consistent improvements on complex multi-image and video-based spatial reasoning tasks.
arXiv Detail & Related papers (2026-02-11T17:42:37Z)
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z)
ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking [0.6143225301480709]
Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame.<n>To achieve robust tracking, it is essential not only to characterize the target features but also to utilize the context features related to the target.<n>We present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states.
arXiv Detail & Related papers (2025-07-26T09:05:12Z)
Multimodal Prompt Alignment for Facial Expression Recognition [24.470095812039286]
MPA-FER provides fine-grained semantic guidance to the learning process of prompted visual features.<n>Our framework outperforms state-of-the-art methods on three FER benchmark datasets.
arXiv Detail & Related papers (2025-06-26T05:28:57Z)
CLDTracker: A Comprehensive Language Description for Visual Tracking [17.858934583542325]
We propose CLDTracker, a novel Comprehensive Language Description framework for robust visual tracking.<n>Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch.<n> Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance.
arXiv Detail & Related papers (2025-05-29T17:39:30Z)
Multi-Granularity Language-Guided Training for Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.<n>At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.<n>Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z)
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [96.33509740612486]
MM-REACT is a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information.
arXiv Detail & Related papers (2023-03-20T18:31:47Z)
Dynamic Prompting: A Unified Framework for Prompt Tuning [33.175097465669374]
We present a unified dynamic prompt (DP) tuning strategy that dynamically determines different factors of prompts based on specific tasks and instances. Experimental results underscore the significant performance improvement achieved by dynamic prompt tuning across a wide range of tasks. We establish the universal applicability of our approach under full-data, few-shot, and multitask scenarios.
arXiv Detail & Related papers (2023-03-06T06:04:46Z)
Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation [33.03600813115465]
We propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability. To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM) VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description. Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.
arXiv Detail & Related papers (2022-12-03T07:57:31Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [84.88106370842883]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.<n>CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending.<n> Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.