Related papers: Human-Centric Video Anomaly Detection Through Spatio-Temporal Pose Tokenization and Transformer

Human-Centric Video Anomaly Detection Through Spatio-Temporal Pose Tokenization and Transformer

URL: http://arxiv.org/abs/2408.15185v2
Date: Mon, 17 Mar 2025 14:05:49 GMT
Title: Human-Centric Video Anomaly Detection Through Spatio-Temporal Pose Tokenization and Transformer
Authors: Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi,
Abstract summary: Video Anomaly Detection (VAD) presents a significant challenge in computer vision.<n>Human-centric VAD faces additional complexities, including variations in human behavior, potential biases in data, and privacy concerns related to human subjects.<n>Recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference.<n>In this paper, we introduce SPARTA, a novel transformer-based architecture designed specifically for human-centric pose-based VAD.
Score: 2.3349787245442966
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce SPARTA, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. SPARTA introduces an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that produces an enriched representation of human motion over time. This approach ensures that the transformer's attention mechanism captures both spatial and temporal patterns simultaneously, rather than focusing on only one aspect. The addition of the relative pose further emphasizes subtle deviations from normal human movements. The architecture's core, a novel Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that SPARTA consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD.

Related papers

Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation [18.73832646369506]
We propose a novel cross-attention mechanism to encode the scene context for affordance prediction in 2D scenes. First, we sample a probable location for a person within the scene using a variational autoencoder conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding.
arXiv Detail & Related papers (2025-02-19T11:24:45Z)
Towards Robust and Realistic Human Pose Estimation via WiFi Signals [85.60557095666934]
WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding.
arXiv Detail & Related papers (2025-01-16T09:38:22Z)
Towards Adaptive Human-centric Video Anomaly Detection: A Comprehensive Framework and A New Benchmark [2.473948454680334]
Human-centric Video Anomaly Detection (VAD) aims to identify human behaviors that deviate from normal. We introduce the HuVAD (Human-centric privacy-enhanced Video Anomaly Detection) dataset and a novel Unsupervised Continual Anomaly Learning framework.
arXiv Detail & Related papers (2024-08-26T14:55:23Z)
EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z)
A Reliable Framework for Human-in-the-Loop Anomaly Detection in Time Series [17.08674819906415]
We introduce HILAD, a novel framework designed to foster a dynamic and bidirectional collaboration between humans and AI. Through our visual interface, HILAD empowers domain experts to detect, interpret, and correct unexpected model behaviors at scale.
arXiv Detail & Related papers (2024-05-06T07:44:07Z)
Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z)
Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence [47.16903508897047]
In this study, we elucidate that variations in human appearance depend not only on the current frame's pose condition but also on past pose states. We introduce Dyco, a novel method utilizing the delta pose sequence representation for non-rigid deformations. In addition, our inertia-aware 3D human method can unprecedentedly simulate appearance changes caused by inertia at different velocities.
arXiv Detail & Related papers (2024-03-28T06:05:14Z)
DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion [5.901982216097867]
Human de-occlusion aims to infer the appearance of invisible human parts from an occluded image. This paper proposes a dynamic mask-aware transformer (DMAT), which dynamically augments information from human regions. Experiments on the AHP dataset demonstrate its superior performance compared to recent state-of-the-art methods.
arXiv Detail & Related papers (2024-02-07T03:36:41Z)
Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers [28.38686299271394]
We propose a framework for 3D sequence-to-sequence (seq2seq) human pose detection. Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships. Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset.
arXiv Detail & Related papers (2024-01-30T03:00:25Z)
PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation [15.432266117706018]
We develop a simulator that can be controlled in a fine-grained manner to explore the manifold of images of human pose. We introduce a learning-based testing method, termed PoseExaminer, that automatically diagnoses HPS algorithms. We show that our PoseExaminer discovers a variety of limitations in current state-of-the-art models that are relevant in real-world scenarios.
arXiv Detail & Related papers (2023-03-13T17:58:54Z)
Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation [51.00725889172323]
We propose a Pose-Oriented Transformer (POT) with uncertainty guided refinement for 3D human pose estimation. We first develop novel pose-oriented self-attention mechanism and distance-related position embedding for POT to explicitly exploit the human skeleton topology. We present an Uncertainty-Guided Refinement Network (UGRN) to refine pose predictions from POT, especially for the difficult joints.
arXiv Detail & Related papers (2023-02-15T00:22:02Z)
Progressive Multi-view Human Mesh Recovery with Self-Supervision [68.60019434498703]
Existing solutions typically suffer from poor generalization performance to new settings. We propose a novel simulation-based training pipeline for multi-view human mesh recovery.
arXiv Detail & Related papers (2022-12-10T06:28:29Z)
Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos. Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras. We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z)
HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE [37.23381308240617]
We propose Hierarchical Transformer Dynamical Variational Autoencoder, HiT-DVAE, which implements auto-regressive generation with transformer-like attention mechanisms. We evaluate the proposed method on HumanEva-I and Human3.6M with various evaluation methods, and outperform the state-of-the-art methods on most of the metrics.
arXiv Detail & Related papers (2022-04-04T15:12:34Z)
HSPACE: Synthetic Parametric Humans Animated in Complex Environments [67.8628917474705]
We build a large-scale photo-realistic dataset, Human-SPACE, of animated humans placed in complex indoor and outdoor environments. We combine a hundred diverse individuals of varying ages, gender, proportions, and ethnicity, with hundreds of motions and scenes, in order to generate an initial dataset of over 1 million frames. Assets are generated automatically, at scale, and are compatible with existing real time rendering and game engines.
arXiv Detail & Related papers (2021-12-23T22:27:55Z)
Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.