On the Consistency of Video Large Language Models in Temporal Comprehension
- URL: http://arxiv.org/abs/2411.12951v1
- Date: Wed, 20 Nov 2024 00:47:17 GMT
- Title: On the Consistency of Video Large Language Models in Temporal Comprehension
- Authors: Minjoon Jung, Junbin Xiao, Byoung-Tak Zhang, Angela Yao,
- Abstract summary: Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.
We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
- Score: 57.985769348320616
- License:
- Abstract: Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code will be available at https://github.com/minjoong507/Consistency-of-Video-LLM.
Related papers
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships.
We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains.
LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z) - VideoQA in the Era of LLMs: An Empirical Study [108.37456450182054]
Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-intuitive tasks.
This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA.
Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents.
However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments.
arXiv Detail & Related papers (2024-08-08T05:14:07Z) - SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding [50.337896542603524]
We introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding.
Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages.
Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-04-01T15:26:44Z) - Test of Time: Instilling Video-Language Models with a Sense of Time [42.290970800790184]
Seven existing video-language models struggle to understand simple temporal relations.
We propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data.
We observe encouraging performance gains especially when the task needs higher time awareness.
arXiv Detail & Related papers (2023-01-05T14:14:36Z) - Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.