Unifying Visual and Vision-Language Tracking via Contrastive Learning
- URL: http://arxiv.org/abs/2401.11228v1
- Date: Sat, 20 Jan 2024 13:20:54 GMT
- Title: Unifying Visual and Vision-Language Tracking via Contrastive Learning
- Authors: Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang,
Mengxue Kang
- Abstract summary: Single object tracking aims to locate the target object in a video sequence according to different modal references.
Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings.
We present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings.
- Score: 34.49865598433915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single object tracking aims to locate the target object in a video sequence
according to the state specified by different modal references, including the
initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to
the gap between different modalities, most existing trackers are designed for
single or partial of these reference settings and overspecialize on the
specific modality. Differently, we present a unified tracker called UVLTrack,
which can simultaneously handle all three reference settings (BBOX, NL,
NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits.
First, we design a modality-unified feature extractor for joint visual and
language feature learning and propose a multi-modal contrastive loss to align
the visual and language features into a unified semantic space. Second, a
modality-adaptive box head is proposed, which makes full use of the target
reference to mine ever-changing scenario features dynamically from video
contexts and distinguish the target in a contrastive way, enabling robust
performance in different reference settings. Extensive experimental results
demonstrate that UVLTrack achieves promising performance on seven visual
tracking datasets, three vision-language tracking datasets, and three visual
grounding datasets. Codes and models will be open-sourced at
https://github.com/OpenSpaceAI/UVLTrack.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications.
VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance.
We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z) - VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT)
Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens.
We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for
Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach.
We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z) - Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - Joint Visual Grounding and Tracking with Natural Language Specification [6.695284124073918]
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description.
We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task.
Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
arXiv Detail & Related papers (2023-03-21T17:09:03Z) - Visual Tracking by TridentAlign and Context Embedding [71.60159881028432]
We propose novel TridentAlign and context embedding modules for Siamese network-based visual tracking methods.
The performance of the proposed tracker is comparable to that of state-of-the-art trackers, while the proposed tracker runs at real-time speed.
arXiv Detail & Related papers (2020-07-14T08:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.