Related papers: Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

URL: http://arxiv.org/abs/2510.26241v2
Date: Wed, 05 Nov 2025 05:49:17 GMT
Title: Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Authors: Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa,
Abstract summary: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated.<n>We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward.<n>We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans.
Score: 3.701776503593477
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

Related papers

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z)
TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility [70.24211591214528]
Video generative models produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing.<n>Existing Video-Language Models (VLMs) struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning.<n>We introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding.<n>We propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding.
arXiv Detail & Related papers (2025-10-08T21:03:46Z)
Seeing the Arrow of Time in Large Multimodal Models [60.56280929030237]
Current large multimodal models (LMMs) struggle to perceive and utilize temporal directionality in video when responding to language queries.<n>We introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness.<n>For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions.
arXiv Detail & Related papers (2025-06-03T19:32:07Z)
Time Blindness: Why Video-Language Models Can't See What Humans Can? [48.653937503646375]
We introduce $bfSpookyBench, a benchmark where information is solely in temporal sequences of noise-like frames.<n>While humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art vision-language models achieve 0% accuracy.<n>This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues.
arXiv Detail & Related papers (2025-05-30T17:59:12Z)
TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models [13.018267909897014]
Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge, is a fundamental aspect of human visual understanding.<n>We introduce textbfTimeCausality, a novel benchmark designed to evaluate the causal reasoning ability of Vision-Language Models (VLMs) in the temporal dimension.<n>We find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source models like GPT-4o, they fall significantly behind on our benchmark compared with their closed-source competitors.
arXiv Detail & Related papers (2025-05-21T12:18:02Z)
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models [55.48403691519395]
TOMATO is a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding.<n>TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model.
arXiv Detail & Related papers (2024-10-30T17:50:23Z)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.