Zero-Shot Action Recognition in Surveillance Videos
- URL: http://arxiv.org/abs/2410.21113v2
- Date: Tue, 18 Mar 2025 13:30:27 GMT
- Title: Zero-Shot Action Recognition in Surveillance Videos
- Authors: Joao Pereira, Vasco Lopes, David Semedo, Joao Neves,
- Abstract summary: Current AI-based video surveillance systems rely on core computer vision models that require extensive finetuning.<n>VideoLLaMA2 represents a significant leap in zero-shot performance, with 20% boost over the baseline.<n>Self-ReS additionally increases zero-shot action recognition performance to 44.6%.
- Score: 5.070026408553652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing demand for surveillance in public spaces presents significant challenges due to the shortage of human resources. Current AI-based video surveillance systems heavily rely on core computer vision models that require extensive finetuning, which is particularly difficult in surveillance settings due to limited datasets and difficult setting (viewpoint, low quality, etc.). In this work, we propose leveraging Large Vision-Language Models (LVLMs), known for their strong zero and few-shot generalization, to tackle video understanding tasks in surveillance. Specifically, we explore VideoLLaMA2, a state-of-the-art LVLM, and an improved token-level sampling method, Self-Reflective Sampling (Self-ReS). Our experiments on the UCF-Crime dataset show that VideoLLaMA2 represents a significant leap in zero-shot performance, with 20% boost over the baseline. Self-ReS additionally increases zero-shot action recognition performance to 44.6%. These results highlight the potential of LVLMs, paired with improved sampling techniques, for advancing surveillance video analysis in diverse scenarios.
Related papers
- VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning [42.22791607763693]
VideoVeritas is a framework for fine-grained perception and fact-based reasoning.<n>Joint Perception Preference and Perception Pretext Reinforcement Learning is used.
arXiv Detail & Related papers (2026-02-09T16:00:01Z) - Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs [21.346216484639225]
Video-MSR is the first benchmark designed to evaluate Multi-hop Spatial Reasoning in dynamic video scenarios.<n>Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline.<n>Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research.
arXiv Detail & Related papers (2026-01-14T12:24:47Z) - MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs [61.70050081221131]
MVU-Eval is the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs.<n>Our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos.<n>These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics.
arXiv Detail & Related papers (2025-11-10T16:02:33Z) - Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z) - TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness [9.374702244811303]
We introduce a self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers.<n>Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency.
arXiv Detail & Related papers (2025-06-25T16:27:38Z) - Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z) - ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z) - mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation [5.647319807077936]
Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning.<n>Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms.
arXiv Detail & Related papers (2025-05-29T23:32:03Z) - Towards Generalized Video Quality Assessment: A Weak-to-Strong Learning Paradigm [76.63001244080313]
Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception.<n>The dominant VQA paradigm relies on supervised training with human-labeled datasets.<n>We explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on large-scale human-labeled datasets.
arXiv Detail & Related papers (2025-05-06T15:29:32Z) - Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images.
Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.
We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.
It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.
Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)
Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection [14.687867348598035]
Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection.
We propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement.
Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting.
arXiv Detail & Related papers (2025-02-20T19:34:58Z) - MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos [62.01402470874109]
We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.
It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.
It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.
We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.
However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.
We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z) - Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.
Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.
Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information.
We introduce AutoBench-V, an automated framework for serving evaluation on demand.
Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation [5.0923114224599555]
This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN.
Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models.
Our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches.
arXiv Detail & Related papers (2024-06-27T01:09:07Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection [11.250490586786878]
Video anomaly detection aims to develop automated models capable of identifying abnormal events in surveillance videos.
We show that distilling knowledge from aggregated representations into a relatively simple model achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-06-05T00:44:42Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
Video-Language Models [28.305932427801682]
We present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of VidLMs on a firm footing.
ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding.
We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images.
arXiv Detail & Related papers (2023-11-13T02:13:13Z) - Towards Surveillance Video-and-Language Understanding: New Dataset,
Baselines, and Challenges [10.809558232493236]
We propose a new research direction of surveillance video-and-language understanding, and construct the first multimodal surveillance video dataset.
We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing.
We benchmark SOTA models for four multimodal tasks on this newly created dataset, which serve as new baselines for surveillance video-and-language understanding.
arXiv Detail & Related papers (2023-09-25T07:46:56Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time
Video Enhancement [132.60976158877608]
We propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples.
In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information.
The proposed design allows our recurrent cells to efficiently propagate-temporal-information across frames and reduces the need for high complexity networks.
arXiv Detail & Related papers (2020-12-24T00:03:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.