Related papers: TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

URL: http://arxiv.org/abs/2410.10818v2
Date: Tue, 15 Oct 2024 17:55:46 GMT
Title: TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Authors: Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang,
Abstract summary: TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
Score: 75.42002690128486
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

Related papers

TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs [24.299498301173255]
TimeBlind is a diagnostic benchmark for fine-grained temporal understanding.<n>We evaluate over 20 state-of-the-art MLLMs on 600 instances.<n>The Instance Accuracy of the best performing MLLM is only 48.2%, far below the human performance (98.2%)
arXiv Detail & Related papers (2026-01-30T20:21:46Z)
What Happens When: Learning Temporal Orders of Events in Videos [23.17822149091485]
Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored.<n>We propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events.<n>We propose MECOT, which trains models on detailed, event-by-event video descriptions and uses chain-of-thought prompts at inference to enhance temporal awareness.
arXiv Detail & Related papers (2025-12-05T07:50:59Z)
HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding [79.06209664703258]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos [26.97196583891564]
We introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos.<n>Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria.<n>This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion.
arXiv Detail & Related papers (2025-05-26T15:24:06Z)
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? [27.128582163847]
We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos.<n>We propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal.
arXiv Detail & Related papers (2025-05-20T13:07:55Z)
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models [28.883607056108605]
TOMATO is a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model.
arXiv Detail & Related papers (2024-10-30T17:50:23Z)
TVBench: Redesigning Video-Language Evaluation [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z)
Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships. We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains. LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z)
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation [57.651809298512276]
ChronoMagic-Bench is a text-to-video (T2V) generation benchmark. It focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. We conduct manual evaluations of ten representative T2V models, revealing their strengths and weaknesses. We create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos.
arXiv Detail & Related papers (2024-06-26T17:50:47Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs) We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench. We first introduce a novel static-to-dynamic method to define these temporal-related tasks. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech. MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.