Divert More Attention to Vision-Language Object Tracking
- URL: http://arxiv.org/abs/2307.10046v1
- Date: Wed, 19 Jul 2023 15:22:06 GMT
- Title: Divert More Attention to Vision-Language Object Tracking
- Authors: Mingzhe Guo, Zhipeng Zhang, Liping Jing, Haibin Ling, Heng Fan
- Abstract summary: We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
- Score: 87.31882921111048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal vision-language (VL) learning has noticeably pushed the tendency
toward generic intelligence owing to emerging large foundation models. However,
tracking, as a fundamental vision problem, surprisingly enjoys less bonus from
recent flourishing VL learning. We argue that the reasons are two-fold: the
lack of large-scale vision-language annotated videos and ineffective
vision-language interaction learning of current works. These nuisances motivate
us to design more effective vision-language representation for tracking,
meanwhile constructing a large database with language annotation for model
learning. Particularly, in this paper, we first propose a general attribute
annotation strategy to decorate videos in six popular tracking benchmarks,
which contributes a large-scale vision-language tracking database with more
than 23,000 videos. We then introduce a novel framework to improve tracking by
learning a unified-adaptive VL representation, where the cores are the proposed
asymmetric architecture search and modality mixer (ModaMixer). To further
improve VL representation, we introduce a contrastive loss to align different
modalities. To thoroughly evidence the effectiveness of our method, we
integrate the proposed framework on three tracking methods with different
designs, i.e., the CNN-based SiamCAR, the Transformer-based OSTrack, and the
hybrid structure TransT. The experiments demonstrate that our framework can
significantly improve all baselines on six benchmarks. Besides empirical
results, we theoretically analyze our approach to show its rationality. By
revealing the potential of VL representation, we expect the community to divert
more attention to VL tracking and hope to open more possibilities for future
tracking with diversified multimodal messages.
Related papers
- ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications.
VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance.
We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - DeepSeek-VL: Towards Real-World Vision-Language Understanding [24.57011093316788]
We present DeepSeek-VL, an open-source Vision-Language (VL) Model for real-world vision and language understanding applications.
Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios.
We create a use case taxonomy from real user scenarios and construct an instruction tuning dataset.
arXiv Detail & Related papers (2024-03-08T18:46:00Z) - PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter [21.45490901191175]
PaLM2-VAdapter employs a progressively aligned language model as the vision-language adapter.
Our method achieves these advancements with 3070% fewer parameters than the state-of-the-art large vision-language models.
arXiv Detail & Related papers (2024-02-16T18:54:47Z) - All in One: Exploring Unified Vision-Language Tracking with Multi-Modal
Alignment [23.486297020327257]
Current vision-language (VL) tracking framework consists of three parts, ie a visual feature extractor, a language feature extractor, and a fusion model.
We propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone.
arXiv Detail & Related papers (2023-07-07T03:51:21Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.