VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
- URL: http://arxiv.org/abs/2311.17404v2
- Date: Sat, 21 Sep 2024 14:59:07 GMT
- Title: VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
- Authors: Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, Lu Hou,
- Abstract summary: We present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding.
We first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects.
We generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect.
- Score: 27.280311932711847
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research.
Related papers
- On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.
We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z) - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships.
We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains.
LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Contrastive Language Video Time Pre-training [12.876308881183371]
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning.
Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features.
We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-06-04T02:48:59Z) - Test of Time: Instilling Video-Language Models with a Sense of Time [42.290970800790184]
Seven existing video-language models struggle to understand simple temporal relations.
We propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data.
We observe encouraging performance gains especially when the task needs higher time awareness.
arXiv Detail & Related papers (2023-01-05T14:14:36Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.