Evaluation of Vision-LLMs in Surveillance Video
- URL: http://arxiv.org/abs/2510.23190v1
- Date: Mon, 27 Oct 2025 10:27:02 GMT
- Title: Evaluation of Vision-LLMs in Surveillance Video
- Authors: Pascal Benschop, Cristian Meo, Justin Dauwels, Jelte P. Mense,
- Abstract summary: This paper investigates the spatial reasoning of vision-language models (VLMs)<n>It addresses the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video.<n>We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-preserving conditions.
- Score: 8.750453732584491
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-language models (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre-trained vision--LLMs can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-preserving conditions. Few-shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters -- especially full-body GAN transforms -- introduce inconsistencies that degrade accuracy. These results chart where current vision--LLMs succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory across clips, scene-graph or 3D-pose priors during description, and privacy methods that preserve action-relevant geometry. This positions zero-shot, language-grounded pipelines as adaptable building blocks for embodied, real-world video understanding. Our implementation for evaluating VLMs is publicly available at: https://github.com/pascalbenschopTU/VLLM_AnomalyRecognition
Related papers
- Video Spatial Reasoning with Object-Centric 3D Rollout [58.12446467377404]
We propose Object-Centric 3D Rollout (OCR) to enable robust video spatial reasoning.<n>OCR introduces structured perturbations to the 3D geometry of selected objects during training.<n>OCR compels the model to reason holistically across the entire scene.
arXiv Detail & Related papers (2025-11-17T09:53:41Z) - Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding [56.369026347458835]
We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space.<n>Current privacy preservation methods on input-pixel-level anonymization require retraining the entire utility video model.<n>A lightweight Anonym Adapter Module (AAM) removes private information from video features while retaining general task utility.
arXiv Detail & Related papers (2025-11-11T18:56:27Z) - Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection [11.197888893266535]
Flashback is a zero-shot and real-time video anomaly detection paradigm.<n>Inspired by the human cognitive mechanism of instantly judging anomalies, Flashback operates in two stages: Recall and Respond.<n>By eliminating all LLM calls at inference time, Flashback delivers real-time VAD even on a consumer-grade GPU.
arXiv Detail & Related papers (2025-05-21T07:32:29Z) - Vision language models are unreliable at trivial spatial cognition [0.2902243522110345]
Vision language models (VLMs) are designed to extract relevant visuospatial information from images.<n>We develop a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs.<n>Results show that performance could be degraded by minor variations of prompts that use equivalent descriptions.
arXiv Detail & Related papers (2025-04-22T17:38:01Z) - Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z) - Spatio-temporal Transformers for Action Unit Classification with Event Cameras [28.98336123799572]
We present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of RGB videos and event streams.
We show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos.
arXiv Detail & Related papers (2024-10-29T11:23:09Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments [67.83787474506073]
We tackle the limitations of current LiDAR-based 3D object detection systems.
We introduce a universal textscFind n' Propagate approach for 3D OV tasks.
We achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes.
arXiv Detail & Related papers (2024-03-20T12:51:30Z) - FoV-Net: Field-of-View Extrapolation Using Self-Attention and
Uncertainty [95.11806655550315]
We utilize information from a video sequence with a narrow field-of-view to infer the scene at a wider field-of-view.
We propose a temporally consistent field-of-view extrapolation framework, namely FoV-Net.
Experiments show that FoV-Net does not only extrapolate the temporally consistent wide field-of-view scene better than existing alternatives.
arXiv Detail & Related papers (2022-04-04T06:24:03Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.