All in One: Exploring Unified Vision-Language Tracking with Multi-Modal
Alignment
- URL: http://arxiv.org/abs/2307.03373v1
- Date: Fri, 7 Jul 2023 03:51:21 GMT
- Title: All in One: Exploring Unified Vision-Language Tracking with Multi-Modal
Alignment
- Authors: Chunhui Zhang, and Xin Sun, and Li Liu, and Yiqian Yang, and Qiong
Liu, and Xi Zhou, and Yanfeng Wang
- Abstract summary: Current vision-language (VL) tracking framework consists of three parts, ie a visual feature extractor, a language feature extractor, and a fusion model.
We propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone.
- Score: 23.486297020327257
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current mainstream vision-language (VL) tracking framework consists of three
parts, \ie a visual feature extractor, a language feature extractor, and a
fusion model. To pursue better performance, a natural modus operandi for VL
tracking is employing customized and heavier unimodal encoders, and multi-modal
fusion models. Albeit effective, existing VL trackers separate feature
extraction and feature integration, resulting in extracted features that lack
semantic guidance and have limited target-aware capability in complex
scenarios, \eg similar distractors and extreme illumination. In this work,
inspired by the recent success of exploring foundation models with unified
architecture for both natural language and computer vision tasks, we propose an
All-in-One framework, which learns joint feature extraction and interaction by
adopting a unified transformer backbone. Specifically, we mix raw vision and
language signals to generate language-injected vision tokens, which we then
concatenate before feeding into the unified backbone architecture. This
approach achieves feature integration in a unified backbone, removing the need
for carefully-designed fusion modules and resulting in a more effective and
efficient VL tracking framework. To further improve the learning efficiency, we
introduce a multi-modal alignment module based on cross-modal and intra-modal
contrastive objectives, providing more reasonable representations for the
unified All-in-One transformer backbone. Extensive experiments on five
benchmarks, \ie OTB99-L, TNL2K, LaSOT, LaSOT$_{\rm Ext}$ and WebUAV-3M,
demonstrate the superiority of the proposed tracker against existing
state-of-the-arts on VL tracking. Codes will be made publicly available.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z) - Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.