Related papers: Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video

Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video

URL: http://arxiv.org/abs/2601.15780v1
Date: Thu, 22 Jan 2026 09:14:11 GMT
Title: Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video
Authors: Pascal Benschop, Justin Dauwels, Jan van Gemert,
Abstract summary: We introduce a synthetic benchmark that probes two complementary skills: situational awareness and spatial awareness.<n>We test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment.<n>Results show performance only slightly above chance across tasks.
Score: 18.381850705061
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.

Related papers

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models [18.243585941034116]
Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion.<n>We introduce REVEAL, a diagnostic benchmark that probes fundamental weaknesses of contemporary Vids.<n>We find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal scalable information.
arXiv Detail & Related papers (2026-02-11T17:39:14Z)
VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning [42.22791607763693]
VideoVeritas is a framework for fine-grained perception and fact-based reasoning.<n>Joint Perception Preference and Perception Pretext Reinforcement Learning is used.
arXiv Detail & Related papers (2026-02-09T16:00:01Z)
NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models [8.6767620170781]
Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering.<n>Many models and training methods explicitly encourage continuity across events to enhance narrative coherence.<n>We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context.
arXiv Detail & Related papers (2025-11-09T17:41:11Z)
A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis [64.42659342276117]
Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal.<n>Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific.<n>We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation.
arXiv Detail & Related papers (2025-11-02T14:49:08Z)
Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding [41.828387997311474]
Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering.<n>They remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects.<n>We propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference.
arXiv Detail & Related papers (2025-10-21T06:11:24Z)
From Sight to Insight: Unleashing Eye-Tracking in Weakly Supervised Video Salient Object Detection [60.11169426478452]
This paper aims to introduce fixation information to assist the detection of salient objects under weak supervision.<n>We propose a Position and Semantic Embedding (PSE) module to provide location and semantic guidance during the feature learning process.<n>An Intra-Inter Mixed Contrastive (MCII) model improves thetemporal modeling capabilities under weak supervision.
arXiv Detail & Related papers (2025-06-30T05:01:40Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [77.96693360763925]
Video SimpleQA is the first comprehensive benchmark tailored for factuality evaluation in video contexts.<n>Our work differs from existing video benchmarks through the following key features: Knowledge required: demanding integration of external knowledge beyond the video's explicit narrative.<n>Short-form definitive answer: Answers are crafted as unambiguous and definitively correct in a short format with minimal scoring variance.
arXiv Detail & Related papers (2025-03-24T17:46:09Z)
Unveiling the Tapestry of Consistency in Large Vision-Language Models [25.106467574467448]
We provide a benchmark ConBench to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain.
arXiv Detail & Related papers (2024-05-23T04:08:23Z)
Visual Spatial Reasoning [35.5155400193075]
We present a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English. We show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%.
arXiv Detail & Related papers (2022-04-30T23:03:49Z)
Weakly-Supervised Video Object Grounding via Causal Intervention [82.68192973503119]
We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning. It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning.
arXiv Detail & Related papers (2021-12-01T13:13:03Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.