Related papers: Unifying Visual and Vision-Language Tracking via Contrastive Learning

Unifying Visual and Vision-Language Tracking via Contrastive Learning

URL: http://arxiv.org/abs/2401.11228v1
Date: Sat, 20 Jan 2024 13:20:54 GMT
Title: Unifying Visual and Vision-Language Tracking via Contrastive Learning
Authors: Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang
Abstract summary: Single object tracking aims to locate the target object in a video sequence according to different modal references. Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings. We present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings.
Score: 34.49865598433915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings and overspecialize on the specific modality. Differently, we present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings (BBOX, NL, NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits. First, we design a modality-unified feature extractor for joint visual and language feature learning and propose a multi-modal contrastive loss to align the visual and language features into a unified semantic space. Second, a modality-adaptive box head is proposed, which makes full use of the target reference to mine ever-changing scenario features dynamically from video contexts and distinguish the target in a contrastive way, enabling robust performance in different reference settings. Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets. Codes and models will be open-sourced at https://github.com/OpenSpaceAI/UVLTrack.

Related papers

Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z)
VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT) Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z)
Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM [23.551036494221222]
Visual Language Tracking (VLT) enhances single object tracking (SOT) by integrating natural language descriptions from a video, for the precise tracking of a specified object. Most VLT benchmarks are annotated in a single granularity and lack a coherent semantic framework to provide scientific guidance. We introduce DTLLM-VLT, which automatically generates extensive and multi-granularity text to enhance environmental diversity.
arXiv Detail & Related papers (2024-05-20T16:01:01Z)
Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking [3.416427651955299]
Single object tracking aims to locate one specific target in video sequences, given its initial state. Vision-Language (VL) tracking has emerged as a promising approach. We present a novel tracker that progressively explores target-centric semantics for VL tracking.
arXiv Detail & Related papers (2023-11-28T02:28:12Z)
Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)
Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z)
Joint Visual Grounding and Tracking with Natural Language Specification [6.695284124073918]
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task. Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
arXiv Detail & Related papers (2023-03-21T17:09:03Z)
Visual Tracking by TridentAlign and Context Embedding [71.60159881028432]
We propose novel TridentAlign and context embedding modules for Siamese network-based visual tracking methods. The performance of the proposed tracker is comparable to that of state-of-the-art trackers, while the proposed tracker runs at real-time speed.
arXiv Detail & Related papers (2020-07-14T08:00:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.