Explicit Visual Prompts for Visual Object Tracking
- URL: http://arxiv.org/abs/2401.03142v1
- Date: Sat, 6 Jan 2024 07:12:07 GMT
- Title: Explicit Visual Prompts for Visual Object Tracking
- Authors: Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang,
Xianxian Li
- Abstract summary: textbfEVPTrack is a visual tracking framework that exploits explicit visual prompts between consecutive frames.
We show that our framework can achieve competitive performance at a real-time by exploiting both explicit and multi-scale information.
- Score: 23.561539973210248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to effectively exploit spatio-temporal information is crucial to capture
target appearance changes in visual tracking. However, most deep learning-based
trackers mainly focus on designing a complicated appearance model or template
updating strategy, while lacking the exploitation of context between
consecutive frames and thus entailing the \textit{when-and-how-to-update}
dilemma. To address these issues, we propose a novel explicit visual prompts
framework for visual tracking, dubbed \textbf{EVPTrack}. Specifically, we
utilize spatio-temporal tokens to propagate information between consecutive
frames without focusing on updating templates. As a result, we cannot only
alleviate the challenge of \textit{when-to-update}, but also avoid the
hyper-parameters associated with updating strategies. Then, we utilize the
spatio-temporal tokens to generate explicit visual prompts that facilitate
inference in the current frame. The prompts are fed into a transformer encoder
together with the image tokens without additional processing. Consequently, the
efficiency of our model is improved by avoiding \textit{how-to-update}. In
addition, we consider multi-scale information as explicit visual prompts,
providing multiscale template features to enhance the EVPTrack's ability to
handle target scale changes. Extensive experimental results on six benchmarks
(i.e., LaSOT, LaSOT\rm $_{ext}$, GOT-10k, UAV123, TrackingNet, and TNL2K.)
validate that our EVPTrack can achieve competitive performance at a real-time
speed by effectively exploiting both spatio-temporal and multi-scale
information. Code and models are available at
https://github.com/GXNU-ZhongLab/EVPTrack.
Related papers
- KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z) - Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers [55.46413719810273]
rich-temporal information is crucial to the complicated target appearance in visual tracking.
Our method improves the tracker's performance on six popular tracking benchmarks.
arXiv Detail & Related papers (2024-03-15T02:39:26Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - ODTrack: Online Dense Temporal Token Learning for Visual Tracking [22.628561792412686]
ODTrack is a video-level tracking pipeline that densely associates contextual relationships of video frames in an online token propagation manner.
It achieves a new itSOTA performance on seven benchmarks, while running at real-time speed.
arXiv Detail & Related papers (2024-01-03T11:44:09Z) - Tracking Objects and Activities with Attention for Temporal Sentence
Grounding [51.416914256782505]
Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment.
We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
arXiv Detail & Related papers (2023-02-21T16:42:52Z) - ProContEXT: Exploring Progressive Context Transformer for Tracking [20.35886416084831]
Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template.
This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames.
We revamped the framework with Progressive Context.
Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories.
arXiv Detail & Related papers (2022-10-27T14:47:19Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Context-aware Visual Tracking with Joint Meta-updating [11.226947525556813]
We propose a context-aware tracking model to optimize the tracker over the representation space, which jointly meta-update both branches by exploiting information along the whole sequence.
The proposed tracking method achieves an EAO score of 0.514 on VOT2018 with the speed of 40FPS, demonstrating its capability of improving the accuracy and robustness of the underlying tracker with little speed drop.
arXiv Detail & Related papers (2022-04-04T14:16:00Z) - STMTrack: Template-free Visual Tracking with Space-time Memory Networks [42.06375415765325]
Existing trackers with template updating mechanisms rely on time-consuming numerical optimization and complex hand-designed strategies to achieve competitive performance.
We propose a novel tracking framework built on top of a space-time memory network that is competent to make full use of historical information related to the target.
Specifically, a novel memory mechanism is introduced, which stores the historical information of the target to guide the tracker to focus on the most informative regions in the current frame.
arXiv Detail & Related papers (2021-04-01T08:10:56Z) - Learning Spatio-Appearance Memory Network for High-Performance Visual
Tracking [79.80401607146987]
Existing object tracking usually learns a bounding-box based template to match visual targets across frames, which cannot accurately learn a pixel-wise representation.
This paper presents a novel segmentation-based tracking architecture, which is equipped with a local-temporal memory network to learn accurate-temporal correspondence.
arXiv Detail & Related papers (2020-09-21T08:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.