Related papers: VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

URL: http://arxiv.org/abs/2311.17404v1
Date: Wed, 29 Nov 2023 07:15:34 GMT
Title: VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Authors: Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, Lu Hou
Abstract summary: We present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. We first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. We generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect.
Score: 28.455280591607686
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research.

Related papers

Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding [26.932018092345192]
We conduct a thorough empirical study to demystify crucial components that influence the temporal understanding of LVLMs.<n>Building on these insights, we propose a temporal-oriented recipe that encompasses temporal-oriented training schemes and an upscaled interface.
arXiv Detail & Related papers (2025-05-19T01:40:25Z)
Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models [44.99833362998488]
Temporal Semantic Alignment via Dynamic Prompting (TSADP) is a novel framework that enhances temporal reasoning capabilities. We evaluate TSADP on the VidSitu dataset, augmented with enriched temporal annotations. Our analysis highlights the robustness, efficiency, and practical utility of TSADP, making it a step forward in the field of video-language understanding.
arXiv Detail & Related papers (2024-12-16T02:37:58Z)
Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey [23.514029232902953]
We present the first comprehensive review of RS-STVLMs.<n>We discuss progress in representative tasks, such as change captioning, change question, answering captions and change grounding.<n>We aim to illuminate current achievements and promising directions for future research in vision-language understanding for remote sensing.
arXiv Detail & Related papers (2024-12-03T16:56:10Z)
On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships. We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains. LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
Contrastive Language Video Time Pre-training [12.876308881183371]
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features. We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-06-04T02:48:59Z)
Test of Time: Instilling Video-Language Models with a Sense of Time [42.290970800790184]
Seven existing video-language models struggle to understand simple temporal relations. We propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We observe encouraging performance gains especially when the task needs higher time awareness.
arXiv Detail & Related papers (2023-01-05T14:14:36Z)
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z)
Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis. We characterize the limitations and potential of current video-language benchmarks. We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.