Related papers: Time Blindness: Why Video-Language Models Can't See What Humans Can?

Time Blindness: Why Video-Language Models Can't See What Humans Can?

URL: http://arxiv.org/abs/2505.24867v1
Date: Fri, 30 May 2025 17:59:12 GMT
Title: Time Blindness: Why Video-Language Models Can't See What Humans Can?
Authors: Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny,
Abstract summary: We introduce $bfSpookyBench, a benchmark where information is solely in temporal sequences of noise-like frames.<n>While humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art vision-language models achieve 0% accuracy.<n>This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues.
Score: 48.653937503646375
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

Related papers

Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos. The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z)
Test of Time: Instilling Video-Language Models with a Sense of Time [42.290970800790184]
Seven existing video-language models struggle to understand simple temporal relations. We propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We observe encouraging performance gains especially when the task needs higher time awareness.
arXiv Detail & Related papers (2023-01-05T14:14:36Z)
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding. We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences. Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z)
Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world. No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies. We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z)
StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality. We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for. Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z)
ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos [91.29436920371003]
We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
arXiv Detail & Related papers (2021-05-25T07:54:35Z)
Temporal Interlacing Network [8.876132549551738]
temporal interlacing network (TIN) is a simple yet powerful operator for learning temporal features. TIN fuses the two kinds of information by interlacing spatial representations from the past to the future. TIN wins the $1st$ in the ICCV19 - Multi Moments in Time challenge.
arXiv Detail & Related papers (2020-01-17T19:06:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.