Related papers: Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

URL: http://arxiv.org/abs/2602.11244v1
Date: Wed, 11 Feb 2026 17:39:14 GMT
Title: Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models
Authors: Sethuraman T, Savya Khosla, Aditi Tiwari, Vidya Ganesh, Rakshana Jayaprakash, Aditya Jain, Vignesh Srinivasakumar, Onkar Kishor Susladkar, Srinidhi Sunkara, Aditya Shanmugham, Rakesh Vaideeswaran, Abbaas Alif Mohamed Nishar, Simon Jenni, Derek Hoiem,
Abstract summary: Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion.<n>We introduce REVEAL, a diagnostic benchmark that probes fundamental weaknesses of contemporary Vids.<n>We find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal scalable information.
Score: 18.243585941034116
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.

Related papers

VideoSTF: Stress-Testing Output Repetition in Video Large Language Models [18.707229361658527]
We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs.<n>We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs.<n>Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems.
arXiv Detail & Related papers (2026-02-11T08:40:48Z)
Video-BrowseComp: Benchmarking Agentic Video Research on Open Web [64.53060049124961]
Video-BrowseComp is a benchmark comprising 210 questions tailored for open-web agentic video reasoning.<n>It enforces a mandatory dependency on temporal visual evidence, ensuring answers cannot be derived solely through text search.<n>As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
arXiv Detail & Related papers (2025-12-28T19:08:27Z)
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? [27.128582163847]
We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos.<n>We propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal.
arXiv Detail & Related papers (2025-05-20T13:07:55Z)
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [77.96693360763925]
Video SimpleQA is the first comprehensive benchmark tailored for factuality evaluation in video contexts.<n>Our work differs from existing video benchmarks through the following key features: Knowledge required: demanding integration of external knowledge beyond the video's explicit narrative.<n>Short-form definitive answer: Answers are crafted as unambiguous and definitively correct in a short format with minimal scoring variance.
arXiv Detail & Related papers (2025-03-24T17:46:09Z)
On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.<n>We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z)
Lost in Time: A New Temporal Benchmark for VideoLLMs [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning.<n>We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z)
Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships. We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains. LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z)
VideoQA in the Era of LLMs: An Empirical Study [108.37456450182054]
Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-intuitive tasks.<n>This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA.<n>Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents.<n>However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments.
arXiv Detail & Related papers (2024-08-08T05:14:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.