Related papers: Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs

Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs

URL: http://arxiv.org/abs/2510.17651v1
Date: Mon, 20 Oct 2025 15:26:43 GMT
Title: Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs
Authors: Sébastien Thuau, Siba Haidar, Ayush Bajracharya, Rachid Chelouah,
Abstract summary: We compare zero-shot and federated fine-tuning of vision-language models (VLMs) and personalized training of a compact 3D convolutional neural network (CNN3D)<n>We evaluate accuracy, calibration, and energy usage under realistic non-IID settings.<n>These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios.
Score: 0.27998963147546135
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.

Related papers

Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use [0.30586855806896035]
Federated learning preserves privacy but deploying large vision-preserving models (VLMs) introduces major energy and sustainability challenges.<n>We compare three strategies for federated violence detection under realistic non-IID splits on the RWF-2000 and RLVS datasets.<n>All methods exceed 90% accuracy in binary violence detection.
arXiv Detail & Related papers (2025-11-10T15:01:51Z)
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z)
Fine-Tuning Vision-Language Models for Neutrino Event Analysis in High-Energy Physics Experiments [41.33501105382656]
Vision-Language Model (VLM) for classifying neutrino interactions from pixelated detector images in high-energy physics experiments.<n>We benchmark its performance against an established CNN baseline used in experiments like NOvA and DUNE, evaluating metrics such as classification accuracy, precision, recall, and AUC-ROC.<n>Our results show that the VLM not only matches or exceeds CNN performance but also enables richer reasoning and better integration of auxiliary textual or semantic context.
arXiv Detail & Related papers (2025-08-26T19:12:28Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Learning Dynamic Local Context Representations for Infrared Small Target Detection [5.897465234102489]
Infrared small target detection (ISTD) is challenging due to complex backgrounds, low signal-to-clutter ratios, and varying target sizes and shapes.<n>We propose LCRNet, a novel method that learns dynamic local context representations for ISTD.<n>With only 1.65M parameters, LCRNet achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-12-23T09:06:27Z)
Comprehensive and Comparative Analysis between Transfer Learning and Custom Built VGG and CNN-SVM Models for Wildfire Detection [1.8616107180090005]
This paper examines the efficiency and effectiveness of transfer learning in the context of wildfire detection. Three purpose-built models -- Visual Geometry Group (VGG)-7, VGG-10, and Convolutional Neural Network (CNN)-Support Vector Machine(SVM) CNN-SVM -- are rigorously compared. We trained and evaluated these models using a dataset that captures the complexities of wildfires.
arXiv Detail & Related papers (2024-11-12T20:30:23Z)
Shedding More Light on Robust Classifiers under the lens of Energy-based Models [3.953603590878949]
We offer a new take on the dynamics of adversarial training (AT) Our analysis of the energy landscape during AT reveals that untargeted attacks generate adversarial images much more in-distribution (lower energy) than the original data from the point of view of the model. Motivated by rigorous evidence, we propose Weighted Energy Adversarial Training (WEAT)
arXiv Detail & Related papers (2024-07-08T18:31:19Z)
Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs. We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z)
CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z)
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks [82.21746840893658]
This paper investigates the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network. We show that while the ResNet-18 model trained on DWT spectrograms achieves a high recognition accuracy, attacking this model is relatively more costly for the adversary.
arXiv Detail & Related papers (2022-04-14T15:14:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.