Related papers: DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

URL: http://arxiv.org/abs/2410.02492v2
Date: Wed, 9 Oct 2024 14:07:15 GMT
Title: DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM
Authors: Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing Zhang, Kaiqi Huang,
Abstract summary: We propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks. We offer four texts in our benchmark, considering the extent and density of semantic information. We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance.
Score: 23.551036494221222
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on http://videocube.aitestunion.com/.

Related papers

How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking [23.551036494221222]
Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information. Current VLT trackers often underperform compared to single-modality methods on multiple benchmarks. We propose VLTVerse, the first fine-grained evaluation framework for VLT trackers.
arXiv Detail & Related papers (2024-11-23T16:31:40Z)
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model [29.702895846058265]
Vision-Language(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions.
arXiv Detail & Related papers (2024-11-04T02:43:55Z)
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark. It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z)
Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark [23.551036494221222]
Visual Language Tracking (VLT) enhances tracking by mitigating the limitations of relying solely on the visual modality. Current VLT benchmarks do not account for multi-round interactions during tracking. We propose a novel and robust benchmark, VLT-MI, which introduces multi-round interaction into the VLT task for the first time.
arXiv Detail & Related papers (2024-09-13T14:54:37Z)
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation. We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM [23.551036494221222]
Visual Language Tracking (VLT) enhances single object tracking (SOT) by integrating natural language descriptions from a video, for the precise tracking of a specified object. Most VLT benchmarks are annotated in a single granularity and lack a coherent semantic framework to provide scientific guidance. We introduce DTLLM-VLT, which automatically generates extensive and multi-granularity text to enhance environmental diversity.
arXiv Detail & Related papers (2024-05-20T16:01:01Z)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments. Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z)
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. We evaluate various baseline methods with and without large-scale VidL pre-training. The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.