Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues
- URL: http://arxiv.org/abs/2412.19648v1
- Date: Fri, 27 Dec 2024 13:54:32 GMT
- Title: Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues
- Authors: X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang,
- Abstract summary: Vision-Language Tracking aims to localize a target in video sequences using a visual template and language description.
Current datasets contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively.
We propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models.
- Score: 0.6143225301480709
- License:
- Abstract: Vision-Language Tracking (VLT) aims to localize a target in video sequences using a visual template and language description. While textual cues enhance tracking potential, current datasets typically contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively. To address this imbalance, we propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models. CTVLT converts textual cues into interpretable visual heatmaps, which are easier for trackers to process. Specifically, we design a textual cue mapping module that transforms textual cues into target distribution heatmaps, visually representing the location described by the text. Additionally, the heatmap guidance module fuses these heatmaps with the search image to guide tracking more effectively. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our approach, achieving state-of-the-art performance and validating the utility of our method for enhanced VLT.
Related papers
- AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding [63.09928907734156]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.
Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z) - ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications.
VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance.
We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z) - Attention Prompting on Image for Large Vision-Language Models [63.794304207664176]
We propose a new prompting technique named Attention Prompting on Image.
We generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP.
Experiments on various vison-language benchmarks verify the effectiveness of our technique.
arXiv Detail & Related papers (2024-09-25T17:59:13Z) - LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model [20.007650672107566]
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos.
Recent methods track the zero-shot results of state-of-the-art image text spotters directly.
Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements.
arXiv Detail & Related papers (2024-05-29T15:35:09Z) - GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching [77.0306273129475]
Video text spotting presents an augmented challenge with the inclusion of tracking.
GoMatching focuses the training efforts on tracking while maintaining strong recognition performance.
GoMatching delivers new records on ICDAR15-video, DSText, BOVText, and our proposed novel test with arbitrary-shaped text termed ArTVideo.
arXiv Detail & Related papers (2024-01-13T13:59:15Z) - Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images.
We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model.
Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z) - CiteTracker: Correlating Image and Text for Visual Tracking [114.48653709286629]
We propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text.
Specifically, we develop a text generation module to convert the target image patch into a descriptive text.
We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference.
arXiv Detail & Related papers (2023-08-22T09:53:12Z) - CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner.
To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction.
Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z) - VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts [2.0434814235659555]
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning.
We propose to enhance CLIP via Visual-guided Texts, named VT-CLIP.
In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
arXiv Detail & Related papers (2021-12-04T18:34:24Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.