Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use
- URL: http://arxiv.org/abs/2511.07171v1
- Date: Mon, 10 Nov 2025 15:01:51 GMT
- Title: Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use
- Authors: Sébastien Thuau, Siba Haidar, Rachid Chelouah,
- Abstract summary: Federated learning preserves privacy but deploying large vision-preserving models (VLMs) introduces major energy and sustainability challenges.<n>We compare three strategies for federated violence detection under realistic non-IID splits on the RWF-2000 and RLVS datasets.<n>All methods exceed 90% accuracy in binary violence detection.
- Score: 0.30586855806896035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning-based video surveillance increasingly demands privacy-preserving architectures with low computational and environmental overhead. Federated learning preserves privacy but deploying large vision-language models (VLMs) introduces major energy and sustainability challenges. We compare three strategies for federated violence detection under realistic non-IID splits on the RWF-2000 and RLVS datasets: zero-shot inference with pretrained VLMs, LoRA-based fine-tuning of LLaVA-NeXT-Video-7B, and personalized federated learning of a 65.8M-parameter 3D CNN. All methods exceed 90% accuracy in binary violence detection. The 3D CNN achieves superior calibration (ROC AUC 92.59%) at roughly half the energy cost (240 Wh vs. 570 Wh) of federated LoRA, while VLMs provide richer multimodal reasoning. Hierarchical category grouping (based on semantic similarity and class exclusion) boosts VLM multiclass accuracy from 65.31% to 81% on the UCF-Crime dataset. To our knowledge, this is the first comparative simulation study of LoRA-tuned VLMs and personalized CNNs for federated violence detection, with explicit energy and CO2e quantification. Our results inform hybrid deployment strategies that default to efficient CNNs for routine inference and selectively engage VLMs for complex contextual reasoning.
Related papers
- VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning [42.22791607763693]
VideoVeritas is a framework for fine-grained perception and fact-based reasoning.<n>Joint Perception Preference and Perception Pretext Reinforcement Learning is used.
arXiv Detail & Related papers (2026-02-09T16:00:01Z) - MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence [61.065486539729875]
MMSI-Video-Bench is a fully human-annotated benchmark for video-based spatial intelligence in MLLMs.<n>It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips.<n>We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap.
arXiv Detail & Related papers (2025-12-11T17:57:24Z) - VideoSSR: Video Self-Supervised Reinforcement Learning [62.25888935329454]
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs)<n>This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data?
arXiv Detail & Related papers (2025-11-09T08:36:40Z) - Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs [0.27998963147546135]
We compare zero-shot and federated fine-tuning of vision-language models (VLMs) and personalized training of a compact 3D convolutional neural network (CNN3D)<n>We evaluate accuracy, calibration, and energy usage under realistic non-IID settings.<n>These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios.
arXiv Detail & Related papers (2025-10-20T15:26:43Z) - Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning [0.1749935196721634]
We employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning.<n>Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815.<n>We evaluate the model's performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Triple Margint, demonstrating the robustness of our proposed architecture.
arXiv Detail & Related papers (2025-09-08T16:01:02Z) - Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - SpaceR: Reinforcing MLLMs in Video Spatial Reasoning [70.7401015322983]
Video spatial reasoning poses a significant challenge for existing Multimodal Large Language Models (MLLMs)<n>This limitation stems primarily from 1) the absence of high-quality datasets for this task, and 2) the lack of effective training strategies to develop spatial reasoning capabilities.<n>Motivated by the success of Reinforcement Learning with Verifiable Reward (RLVR) in unlocking spatial reasoning abilities, this aims to improve MLLMs in video spatial reasoning through the RLVR paradigm.
arXiv Detail & Related papers (2025-04-02T15:12:17Z) - BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z) - A Benchmark for Crime Surveillance Video Analysis with Large Models [22.683394427744616]
Anomaly analysis in surveillance videos is a crucial topic in computer vision.<n>In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains.<n>We propose a benchmark for crime surveillance video analysis with large models denoted as UCVL.
arXiv Detail & Related papers (2025-02-13T13:38:17Z) - Analysis of Real-Time Hostile Activitiy Detection from Spatiotemporal
Features Using Time Distributed Deep CNNs, RNNs and Attention-Based
Mechanisms [0.0]
Real-time video surveillance, through CCTV camera systems has become essential for ensuring public safety.
Deep learning video classification techniques can help us automate surveillance systems to detect violence as it happens.
arXiv Detail & Related papers (2023-02-21T22:02:39Z) - EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge
Distillation and Modal-adaptive Pruning [19.354515754130592]
We introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones.
We apply our framework to train EfficientVLM, a fast and accurate vision-language model consisting of 6 vision layers, 3 text layers, and 3 cross-modal fusion layers.
EfficientVLM retains 98.4% performance of the teacher model and accelerates its inference speed by 2.2x.
arXiv Detail & Related papers (2022-10-14T13:26:41Z) - Support-Set Based Cross-Supervision for Video Grounding [98.29089558426399]
Support-set Based Cross-Supervision (Sscs) module can improve existing methods during training phase without extra inference cost.
The proposed Sscs module contains two main components, i.e., discriminative contrastive objective and generative caption objective.
We extensively evaluate Sscs on three challenging datasets, and show that our method can improve current state-of-the-art methods by large margins.
arXiv Detail & Related papers (2021-08-24T08:25:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.