Related papers: GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

URL: http://arxiv.org/abs/2507.09491v1
Date: Sun, 13 Jul 2025 04:44:57 GMT
Title: GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
Authors: Yiyang Zhou, Linjie Li, Shi Qiu, Zhengyuan Yang, Yuyang Zhao, Siwei Han, Yangfan He, Kangqi Li, Haonian Ji, Zihao Zhao, Haibo Tong, Lijuan Wang, Huaxiu Yao,
Abstract summary: GLIMPSE consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories.<n>All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context.<n>In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges.
Score: 76.67205289006795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.

Related papers

Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding [39.41651859086456]
We introduce the Video Thinking Test (Video-TT) to assess if video large language models (video LLMs) can interpret real-world videos as effectively as humans.<n>Video-TT reflects genuine gaps in understanding complex visual narratives, and evaluates robustness against natural adversarial questions.<n>Our evaluation shows a significant gap between video LLMs and human performance.
arXiv Detail & Related papers (2025-07-20T16:30:33Z)
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? [56.06537213958482]
We present Video-Holmes, a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.<n>Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films.<n>Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information.
arXiv Detail & Related papers (2025-05-27T16:05:01Z)
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? [27.128582163847]
We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos.<n>We propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal.
arXiv Detail & Related papers (2025-05-20T13:07:55Z)
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding [43.858197893052115]
CG-Bench is a novel benchmark for clue-grounded question answering in long videos.<n>It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories.<n>The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination.
arXiv Detail & Related papers (2024-12-16T18:46:45Z)
VideoQA in the Era of LLMs: An Empirical Study [108.37456450182054]
Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-intuitive tasks.<n>This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA.<n>Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents.<n>However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments.
arXiv Detail & Related papers (2024-08-08T05:14:07Z)
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z)
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment [19.313541287648473]
We introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents.<n>We adopt the Video-Language Entailment setup and propose StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption.<n>Results show that action understanding lags behind agents, and negative captions created using entities appearing in the video perform worse than those obtained from pure text manipulation.
arXiv Detail & Related papers (2024-06-16T10:42:21Z)
VideoPrism: A Foundational Visual Encoder for Video Understanding [90.01845485201746]
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.<n>We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text.<n>We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
arXiv Detail & Related papers (2024-02-20T18:29:49Z)
FunQA: Towards Surprising Video Comprehension [64.58663825184958]
We introduce FunQA, a challenging video question-answering dataset. FunQA covers three previously unexplored types of surprising videos: HumorQA, CreativeQA, and MagicQA. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips.
arXiv Detail & Related papers (2023-06-26T17:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.