Towards Adaptive Human-centric Video Anomaly Detection: A Comprehensive Framework and A New Benchmark
- URL: http://arxiv.org/abs/2408.14329v2
- Date: Wed, 19 Mar 2025 18:13:10 GMT
- Title: Towards Adaptive Human-centric Video Anomaly Detection: A Comprehensive Framework and A New Benchmark
- Authors: Armin Danesh Pazho, Shanle Yao, Ghazal Alinezhad Noghre, Babak Rahimi Ardabili, Vinit Katariya, Hamed Tabkhi,
- Abstract summary: Human-centric Video Anomaly Detection (VAD) aims to identify human behaviors that deviate from normal.<n>We introduce the HuVAD (Human-centric privacy-enhanced Video Anomaly Detection) dataset and a novel Unsupervised Continual Anomaly Learning framework.
- Score: 2.473948454680334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-centric Video Anomaly Detection (VAD) aims to identify human behaviors that deviate from normal. At its core, human-centric VAD faces substantial challenges, such as the complexity of diverse human behaviors, the rarity of anomalies, and ethical constraints. These challenges limit access to high-quality datasets and highlight the need for a dataset and framework supporting continual learning. Moving towards adaptive human-centric VAD, we introduce the HuVAD (Human-centric privacy-enhanced Video Anomaly Detection) dataset and a novel Unsupervised Continual Anomaly Learning (UCAL) framework. UCAL enables incremental learning, allowing models to adapt over time, bridging traditional training and real-world deployment. HuVAD prioritizes privacy by providing de-identified annotations and includes seven indoor/outdoor scenes, offering over 5x more pose-annotated frames than previous datasets. Our standard and continual benchmarks, utilize a comprehensive set of metrics, demonstrating that UCAL-enhanced models achieve superior performance in 82.14% of cases, setting a new state-of-the-art (SOTA). The dataset can be accessed at https://github.com/TeCSAR-UNCC/HuVAD.
Related papers
- POET: Prompt Offset Tuning for Continual Human Action Adaptation [61.63831623094721]
We aim to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually.
We formalize this as privacy-aware few-shot continual action recognition.
We propose a novel-temporal learnable prompt tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks.
arXiv Detail & Related papers (2025-04-25T04:11:24Z) - A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.
We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.
Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z) - AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM [1.7051307941715268]
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision.
Existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments.
This study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model.
arXiv Detail & Related papers (2025-03-06T14:52:34Z) - Human-Centric Video Anomaly Detection Through Spatio-Temporal Pose Tokenization and Transformer [2.3349787245442966]
Video Anomaly Detection (VAD) presents a significant challenge in computer vision.
Human-centric VAD faces additional complexities, including variations in human behavior, potential biases in data, and privacy concerns related to human subjects.
Recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference.
In this paper, we introduce SPARTA, a novel transformer-based architecture designed specifically for human-centric pose-based VAD.
arXiv Detail & Related papers (2024-08-27T16:40:14Z) - Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions [57.871692507044344]
Pose estimation aims to accurately identify anatomical keypoints in humans and animals using monocular images.
Current models are typically trained and tested on clean data, potentially overlooking the corruption during real-world deployment.
We introduce PoseBench, a benchmark designed to evaluate the robustness of pose estimation models against real-world corruption.
arXiv Detail & Related papers (2024-06-20T14:40:17Z) - Federated Face Forgery Detection Learning with Personalized Representation [63.90408023506508]
Deep generator technology can produce high-quality fake videos that are indistinguishable, posing a serious social threat.
Traditional forgery detection methods directly centralized training on data.
The paper proposes a novel federated face forgery detection learning with personalized representation.
arXiv Detail & Related papers (2024-06-17T02:20:30Z) - BaboonLand Dataset: Tracking Primates in the Wild and Automating Behaviour Recognition from Drone Videos [0.8074955699721389]
This study presents a novel dataset from drone videos for baboon detection, tracking, and behavior recognition.
The baboon detection dataset was created by manually annotating all baboons in drone videos with bounding boxes.
The behavior recognition dataset was generated by converting tracks into mini-scenes, a video subregion centered on each animal.
arXiv Detail & Related papers (2024-05-27T23:09:37Z) - Data Augmentation in Human-Centric Vision [54.97327269866757]
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks.
It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection.
Our work categorizes data augmentation methods into two main types: data generation and data perturbation.
arXiv Detail & Related papers (2024-03-13T16:05:18Z) - Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - EGOFALLS: A visual-audio dataset and benchmark for fall detection using
egocentric cameras [0.16317061277456998]
Falls are significant and often fatal for vulnerable populations such as the elderly.
Previous works have addressed the detection of falls by relying on data capture by a single sensor, images or accelerometers.
In this work, we rely on multimodal descriptors extracted from videos captured by egocentric cameras.
arXiv Detail & Related papers (2023-09-08T20:14:25Z) - ADG-Pose: Automated Dataset Generation for Real-World Human Pose
Estimation [2.4956060473718407]
ADG-Pose is a method for automatically generating datasets for real-world human pose estimation.
This article presents ADG-Pose, a method for automatically generating datasets for real-world human pose estimation.
arXiv Detail & Related papers (2022-02-01T20:51:58Z) - Beyond Tracking: Using Deep Learning to Discover Novel Interactions in
Biological Swarms [3.441021278275805]
We propose training deep network models to predict system-level states directly from generic graphical features from the entire view.
Because the resulting predictive models are not based on human-understood predictors, we use explanatory modules.
This represents an example of augmented intelligence in behavioral ecology -- knowledge co-creation in a human-AI team.
arXiv Detail & Related papers (2021-08-20T22:50:41Z) - Vision-based Behavioral Recognition of Novelty Preference in Pigs [1.837722971703011]
Behavioral scoring of research data is crucial for extracting domain-specific metrics but is bottlenecked on the ability to analyze enormous volumes of information using human labor.
Deep learning is widely viewed as a key advancement to relieve this bottleneck.
We identify one such domain, where deep learning can be leveraged to alleviate the process of manual scoring.
arXiv Detail & Related papers (2021-06-23T06:10:34Z) - Continual Learning for Blind Image Quality Assessment [80.55119990128419]
Blind image quality assessment (BIQA) models fail to continually adapt to subpopulation shift.
Recent work suggests training BIQA methods on the combination of all available human-rated IQA datasets.
We formulate continual learning for BIQA, where a model learns continually from a stream of IQA datasets.
arXiv Detail & Related papers (2021-02-19T03:07:01Z) - Towards Accurate Human Pose Estimation in Videos of Crowded Scenes [134.60638597115872]
We focus on improving human pose estimation in videos of crowded scenes from the perspectives of exploiting temporal context and collecting new data.
For one frame, we forward the historical poses from the previous frames and backward the future poses from the subsequent frames to current frame, leading to stable and accurate human pose estimation in videos.
In this way, our model achieves best performance on 7 out of 13 videos and 56.33 average w_AP on test dataset of HIE challenge.
arXiv Detail & Related papers (2020-10-16T13:19:11Z) - The AVA-Kinetics Localized Human Actions Video Dataset [124.41706958756049]
This paper describes the AVA-Kinetics localized human actions video dataset.
The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol.
The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames.
arXiv Detail & Related papers (2020-05-01T04:17:14Z) - DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery
Detection [93.24684159708114]
DeeperForensics-1.0 is the largest face forgery detection dataset by far, with 60,000 videos constituted by a total of 17.6 million frames.
The quality of generated videos outperforms those in existing datasets, validated by user studies.
The benchmark features a hidden test set, which contains manipulated videos achieving high deceptive scores in human evaluations.
arXiv Detail & Related papers (2020-01-09T14:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.