Related papers: TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

URL: http://arxiv.org/abs/2602.00288v1
Date: Fri, 30 Jan 2026 20:21:46 GMT
Title: TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
Authors: Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, Gedas Bertasius,
Abstract summary: TimeBlind is a diagnostic benchmark for fine-grained temporal understanding.<n>We evaluate over 20 state-of-the-art MLLMs on 600 instances.<n>The Instance Accuracy of the best performing MLLM is only 48.2%, far below the human performance (98.2%)
Score: 24.299498301173255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at https://baiqi-li.github.io/timeblind_project/ .

Related papers

Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z)
DATE: Dynamic Absolute Time Enhancement for Long Video Understanding [8.720269393713451]
Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs)<n>We propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs.<n>We introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage.
arXiv Detail & Related papers (2025-09-11T08:49:22Z)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding [18.027290155746112]
Temporal Search is a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively.<n>It is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy.<n>It refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos.
arXiv Detail & Related papers (2025-06-28T15:24:05Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models [85.59909303288921]
We introduce RTime-QA, a novel benchmark designed to assess the atomic temporal event understanding ability of Large Multi-modal Models (LMMs)<n>RTime-QA comprises 822 high-quality, carefully-curated video-text questions, each meticulously annotated by human experts.<n>To advance LMMs' temporal event understanding ability, we further introduce RTime-IT, a 14k instruction-tuning dataset that employs a similar annotation process as RTime-QA.
arXiv Detail & Related papers (2025-05-25T12:44:12Z)
Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! [22.75945626401567]
We propose a challenging evaluation benchmark named TemporalVQA.<n>The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames.<n>The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years.<n>Our evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro, reveal significant challenges.
arXiv Detail & Related papers (2025-01-18T06:41:48Z)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships. We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains. LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.