Related papers: Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

URL: http://arxiv.org/abs/2212.01568v1
Date: Sat, 3 Dec 2022 07:57:31 GMT
Title: Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation
Authors: En Yu, Songtao Liu, Zhuoling Li, Jinrong Yang, Zeming li, Shoudong Han, Wenbing Tao
Abstract summary: We propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability. To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM) VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description. Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.
Score: 33.03600813115465
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although existing multi-object tracking (MOT) algorithms have obtained competitive performance on various benchmarks, almost all of them train and validate models on the same domain. The domain generalization problem of MOT is hardly studied. To bridge this gap, we first draw the observation that the high-level information contained in natural language is domain invariant to different tracking domains. Based on this observation, we propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability. However, it is infeasible to label every tracking target with a textual description. To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM). Specifically, VCP generates visual prompts based on the input frames. VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description, which is domain invariant to different tracking scenes. Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.

Related papers

Dynamic Updates for Language Adaptation in Visual-Language Tracking [10.64409248365897]
We propose a vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_rmext$, TNL2K, OTB99-Lang, GOT-10K, and UAV123.
arXiv Detail & Related papers (2025-03-09T13:47:19Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z)
IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking [13.977088329815933]
Multi-Object Tracking (MOT) aims to associate multiple objects across video frames. Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability. We develop IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions.
arXiv Detail & Related papers (2024-10-30T14:24:56Z)
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map. Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point.
arXiv Detail & Related papers (2024-07-31T11:40:29Z)
Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms. We propose a textbfDomain-Controlled Prompt Learning for the specific domains. Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z)
Joint Visual Grounding and Tracking with Natural Language Specification [6.695284124073918]
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task. Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
arXiv Detail & Related papers (2023-03-21T17:09:03Z)
Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z)
End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time. Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z)
Multi-Object Tracking and Segmentation via Neural Message Passing [0.0]
Graphs offer a natural way to formulate Multiple Object Tracking (MOT) and Multiple Object Tracking and (MOTS) We exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks (MPNs) We achieve state-of-the-art results for both tracking and segmentation in several publicly available datasets.
arXiv Detail & Related papers (2022-07-15T13:03:47Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.