Related papers: Improving Visual Object Tracking through Visual Prompting

Improving Visual Object Tracking through Visual Prompting

URL: http://arxiv.org/abs/2409.18901v1
Date: Fri, 27 Sep 2024 16:39:50 GMT
Title: Improving Visual Object Tracking through Visual Prompting
Authors: Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin,
Abstract summary: Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts.
Score: 24.436237938873695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.

Related papers

Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking [23.65057966356924]
OV-MOT aims to enable approaches to track objects without being limited to a predefined set of categories. We propose textbfTRACT, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT.
arXiv Detail & Related papers (2025-03-11T08:03:47Z)
Referencing Where to Focus: Improving VisualGrounding with Referential Query [30.33315985826623]
We propose a novel visual grounding method called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP. Our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network.
arXiv Detail & Related papers (2024-12-26T10:19:20Z)
Learning Object-Centric Representation via Reverse Hierarchy Guidance [73.05170419085796]
Object-Centric Learning (OCL) seeks to enable Neural Networks to identify individual objects in visual scenes. RHGNet introduces a top-down pathway that works in different ways in the training and inference processes. Our model achieves SOTA performance on several commonly used datasets.
arXiv Detail & Related papers (2024-05-17T07:48:27Z)
Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification [13.090873217313732]
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) We first analyze the role prompt learning in CLIP-ReID and identify its limitations. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning.
arXiv Detail & Related papers (2023-10-26T08:12:53Z)
OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes. It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z)
APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot Remote Sensing Image Generalization using CLIP [12.73827827842155]
We propose a novel image-conditioned prompt learning strategy called the Visual Attention conditioned Prompts Learning Network (APPLeNet) APPLeNet emphasizes the importance of multi-scale feature learning in RS scene classification and disentangles visual style and content primitives for domain generalization tasks. Our results consistently outperform the relevant literature and code is available at https://github.com/mainaksingha01/APPLeNet.
arXiv Detail & Related papers (2023-04-12T17:20:37Z)
HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models [30.279621764192843]
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors. We propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization.
arXiv Detail & Related papers (2023-03-28T07:54:54Z)
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
Explicitly Modeling the Discriminability for Instance-Aware Visual Object Tracking [13.311777431243296]
We propose a novel Instance-Aware Tracker (IAT) to excavate the discriminability of feature representations. We implement two variants of the proposed IAT, including a video-level one and an object-level one. Both versions achieve leading results against state-of-the-art methods while running at 30FPS.
arXiv Detail & Related papers (2021-10-28T11:24:01Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning. In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
Learning Target Candidate Association to Keep Track of What Not to Track [100.80610986625693]
We propose to keep track of distractor objects in order to continue tracking the target. To tackle the problem of lacking ground-truth correspondences between distractor objects in visual tracking, we propose a training strategy that combines partial annotations with self-supervision. Our tracker sets a new state-of-the-art on six benchmarks, achieving an AUC score of 67.2% on LaSOT and a +6.1% absolute gain on the OxUvA long-term dataset.
arXiv Detail & Related papers (2021-03-30T17:58:02Z)
Unsupervised Deep Representation Learning for Real-Time Tracking [137.69689503237893]
We propose an unsupervised learning method for visual tracking. The motivation of our unsupervised learning is that a robust tracker should be effective in bidirectional tracking. We build our framework on a Siamese correlation filter network, and propose a multi-frame validation scheme and a cost-sensitive loss to facilitate unsupervised learning.
arXiv Detail & Related papers (2020-07-22T08:23:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.