Generalizing Multiple Object Tracking to Unseen Domains by Introducing
Natural Language Representation
- URL: http://arxiv.org/abs/2212.01568v1
- Date: Sat, 3 Dec 2022 07:57:31 GMT
- Title: Generalizing Multiple Object Tracking to Unseen Domains by Introducing
Natural Language Representation
- Authors: En Yu, Songtao Liu, Zhuoling Li, Jinrong Yang, Zeming li, Shoudong
Han, Wenbing Tao
- Abstract summary: We propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability.
To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM)
VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description.
Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.
- Score: 33.03600813115465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although existing multi-object tracking (MOT) algorithms have obtained
competitive performance on various benchmarks, almost all of them train and
validate models on the same domain. The domain generalization problem of MOT is
hardly studied. To bridge this gap, we first draw the observation that the
high-level information contained in natural language is domain invariant to
different tracking domains. Based on this observation, we propose to introduce
natural language representation into visual MOT models for boosting the domain
generalization ability. However, it is infeasible to label every tracking
target with a textual description. To tackle this problem, we design two
modules, namely visual context prompting (VCP) and visual-language mixing
(VLM). Specifically, VCP generates visual prompts based on the input frames.
VLM joints the information in the generated visual prompts and the textual
prompts from a pre-defined Trackbook to obtain instance-level pseudo textual
description, which is domain invariant to different tracking scenes. Through
training models on MOT17 and validating them on MOT20, we observe that the
pseudo textual descriptions generated by our proposed modules improve the
generalization performance of query-based trackers by large margins.
Related papers
- ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications.
VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance.
We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z) - IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking [13.977088329815933]
Multi-Object Tracking (MOT) aims to associate multiple objects across video frames.
Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability.
We develop IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions.
arXiv Detail & Related papers (2024-10-30T14:24:56Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms.
We propose a textbfDomain-Controlled Prompt Learning for the specific domains.
Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z) - Joint Visual Grounding and Tracking with Natural Language Specification [6.695284124073918]
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description.
We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task.
Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
arXiv Detail & Related papers (2023-03-21T17:09:03Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Multi-Object Tracking and Segmentation via Neural Message Passing [0.0]
Graphs offer a natural way to formulate Multiple Object Tracking (MOT) and Multiple Object Tracking and (MOTS)
We exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks (MPNs)
We achieve state-of-the-art results for both tracking and segmentation in several publicly available datasets.
arXiv Detail & Related papers (2022-07-15T13:03:47Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.