Related papers: Multi-Granularity Language-Guided Multi-Object Tracking

Multi-Granularity Language-Guided Multi-Object Tracking

URL: http://arxiv.org/abs/2406.04844v1
Date: Fri, 7 Jun 2024 11:18:40 GMT
Title: Multi-Granularity Language-Guided Multi-Object Tracking
Authors: Yuhao Li, Muzammal Naseer, Jiale Cao, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan,
Abstract summary: We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
Score: 95.91263758294154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both instance-and scene-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2\% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability. The dataset and code will be available at ~\url{https://github.com/WesLee88524/LG-MOT}.

Related papers

Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework. CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks. Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z)
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z)
IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking [13.977088329815933]
Multi-Object Tracking (MOT) aims to associate multiple objects across video frames. Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability. We develop IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions.
arXiv Detail & Related papers (2024-10-30T14:24:56Z)
DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM [23.551036494221222]
Visual Language Tracking (VLT) enhances single object tracking (SOT) by integrating natural language descriptions from a video, for the precise tracking of a specified object. Most VLT benchmarks are annotated in a single granularity and lack a coherent semantic framework to provide scientific guidance. We introduce DTLLM-VLT, which automatically generates extensive and multi-granularity text to enhance environmental diversity.
arXiv Detail & Related papers (2024-05-20T16:01:01Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
Unifying Visual and Vision-Language Tracking via Contrastive Learning [34.49865598433915]
Single object tracking aims to locate the target object in a video sequence according to different modal references. Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings. We present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings.
arXiv Detail & Related papers (2024-01-20T13:20:54Z)
Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z)
OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes. It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z)
Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z)
Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation [33.03600813115465]
We propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability. To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM) VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description. Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.
arXiv Detail & Related papers (2022-12-03T07:57:31Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.