Generalizing Multiple Object Tracking to Unseen Domains by Introducing
Natural Language Representation
- URL: http://arxiv.org/abs/2212.01568v1
- Date: Sat, 3 Dec 2022 07:57:31 GMT
- Title: Generalizing Multiple Object Tracking to Unseen Domains by Introducing
Natural Language Representation
- Authors: En Yu, Songtao Liu, Zhuoling Li, Jinrong Yang, Zeming li, Shoudong
Han, Wenbing Tao
- Abstract summary: We propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability.
To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM)
VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description.
Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.
- Score: 33.03600813115465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although existing multi-object tracking (MOT) algorithms have obtained
competitive performance on various benchmarks, almost all of them train and
validate models on the same domain. The domain generalization problem of MOT is
hardly studied. To bridge this gap, we first draw the observation that the
high-level information contained in natural language is domain invariant to
different tracking domains. Based on this observation, we propose to introduce
natural language representation into visual MOT models for boosting the domain
generalization ability. However, it is infeasible to label every tracking
target with a textual description. To tackle this problem, we design two
modules, namely visual context prompting (VCP) and visual-language mixing
(VLM). Specifically, VCP generates visual prompts based on the input frames.
VLM joints the information in the generated visual prompts and the textual
prompts from a pre-defined Trackbook to obtain instance-level pseudo textual
description, which is domain invariant to different tracking scenes. Through
training models on MOT17 and validating them on MOT20, we observe that the
pseudo textual descriptions generated by our proposed modules improve the
generalization performance of query-based trackers by large margins.
Related papers
- Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - Context-Aware Integration of Language and Visual References for Natural Language Tracking [27.3884348078998]
Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame.
We propose a joint multi-modal tracking framework with 1) a prompt module to leverage the complement between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues.
This design ensures-temporal consistency by leveraging historical visual information and an integrated solution, generating predictions in a single step.
arXiv Detail & Related papers (2024-03-29T04:58:33Z) - Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms.
We propose a textbfDomain-Controlled Prompt Learning for the specific domains.
Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z) - Joint Visual Grounding and Tracking with Natural Language Specification [6.695284124073918]
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description.
We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task.
Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
arXiv Detail & Related papers (2023-03-21T17:09:03Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Learning Domain Invariant Prompt for Vision-Language Models [31.581652862478965]
We propose a novel prompt learning paradigm that directly generates emphdomain invariant prompt that can be generalized to unseen domains, called MetaPrompt.
Our method consistently and significantly outperforms existing methods.
arXiv Detail & Related papers (2022-12-08T11:23:24Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [144.38869017091199]
Vision transformers (ViTs) in image classification have shifted the methodologies for visual representation learning.
In this work, we explore the global context learning potentials of ViTs for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Multi-Object Tracking and Segmentation via Neural Message Passing [0.0]
Graphs offer a natural way to formulate Multiple Object Tracking (MOT) and Multiple Object Tracking and (MOTS)
We exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks (MPNs)
We achieve state-of-the-art results for both tracking and segmentation in several publicly available datasets.
arXiv Detail & Related papers (2022-07-15T13:03:47Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.