Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task Approaches
- URL: http://arxiv.org/abs/2504.14753v1
- Date: Sun, 20 Apr 2025 22:27:24 GMT
- Title: Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task Approaches
- Authors: Guodong Shen, Yuqi Ouyang, Junru Lu, Yixuan Yang, Victor Sanchez,
- Abstract summary: We introduce an effective hybrid framework designed to generate accurate predictions for normal frames and flawed predictions for abnormal frames.<n>We develop a convolutional temporal transformer that efficiently associates feature maps from all context frames to generate attention-based predictions for target frames.<n>Anomalies are eventually identified by scrutinizing the discrepancies between target frames and their corresponding predictions.
- Score: 16.96592682625058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the prevailing transition from single-task to multi-task approaches in video anomaly detection, we observe that many adopt sub-optimal frameworks for individual proxy tasks. Motivated by this, we contend that optimizing single-task frameworks can advance both single- and multi-task approaches. Accordingly, we leverage middle-frame prediction as the primary proxy task, and introduce an effective hybrid framework designed to generate accurate predictions for normal frames and flawed predictions for abnormal frames. This hybrid framework is built upon a bi-directional structure that seamlessly integrates both vision transformers and ConvLSTMs. Specifically, we utilize this bi-directional structure to fully analyze the temporal dimension by predicting frames in both forward and backward directions, significantly boosting the detection stability. Given the transformer's capacity to model long-range contextual dependencies, we develop a convolutional temporal transformer that efficiently associates feature maps from all context frames to generate attention-based predictions for target frames. Furthermore, we devise a layer-interactive ConvLSTM bridge that facilitates the smooth flow of low-level features across layers and time-steps, thereby strengthening predictions with fine details. Anomalies are eventually identified by scrutinizing the discrepancies between target frames and their corresponding predictions. Several experiments conducted on public benchmarks affirm the efficacy of our hybrid framework, whether used as a standalone single-task approach or integrated as a branch in a multi-task approach. These experiments also underscore the advantages of merging vision transformers and ConvLSTMs for video anomaly detection.
Related papers
- DyTTP: Trajectory Prediction with Normalization-Free Transformers [0.0]
Transformer-based architectures have demonstrated significant promise in capturing complex robustnessity dependencies.
We present a two-fold approach to address these challenges.
First, we integrate DynamicTanh (DyT), which is the latest method to promote transformers, into the backbone, replacing traditional layer normalization.
We are the first work to deploy the DyT to the trajectory prediction task.
arXiv Detail & Related papers (2025-04-07T09:26:25Z) - BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation [4.977568882858193]
We propose a novel bidirectional conditioning factorization in a semantic-aligned space for Scene Graph Generation (SGG)
We introduce an end-to-end scene graph generation model, the Bidirectional Conditioning Transformer (BCTR)
BCTR consists of two key modules. First, the Bidirectional Conditioning Generator (BCG) performs multi-stage interactive feature augmentation between entities and predicates, enabling mutual enhancement between these predictions.
Second, Random Feature Alignment (RFA) is present to regularize feature space by distilling multi-modal knowledge from pre-trained models. Within this regularized feature space, BCG is feasible to capture
arXiv Detail & Related papers (2024-07-26T13:02:48Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub.
It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations.
Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z) - A Hierarchical Hybrid Learning Framework for Multi-agent Trajectory
Prediction [4.181632607997678]
We propose a hierarchical hybrid framework of deep learning (DL) and reinforcement learning (RL) for multi-agent trajectory prediction.
In the DL stage, the traffic scene is divided into multiple intermediate-scale heterogenous graphs based on which Transformer-style GNNs are adopted to encode heterogenous interactions.
In the RL stage, we divide the traffic scene into local sub-scenes utilizing the key future points predicted in the DL stage.
arXiv Detail & Related papers (2023-03-22T02:47:42Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z) - Collaborative Uncertainty Benefits Multi-Agent Multi-Modal Trajectory Forecasting [61.02295959343446]
This work first proposes a novel concept, collaborative uncertainty (CU), which models the uncertainty resulting from interaction modules.<n>We build a general CU-aware regression framework with an original permutation-equivariant uncertainty estimator to do both tasks of regression and uncertainty estimation.<n>We apply the proposed framework to current SOTA multi-agent trajectory forecasting systems as a plugin module.
arXiv Detail & Related papers (2022-07-11T21:17:41Z) - Consistency Regularization for Deep Face Anti-Spoofing [69.70647782777051]
Face anti-spoofing (FAS) plays a crucial role in securing face recognition systems.
Motivated by this exciting observation, we conjecture that encouraging feature consistency of different views may be a promising way to boost FAS models.
We enhance both Embedding-level and Prediction-level Consistency Regularization (EPCR) in FAS.
arXiv Detail & Related papers (2021-11-24T08:03:48Z) - THOMAS: Trajectory Heatmap Output with learned Multi-Agent Sampling [2.424910201171407]
We present a unified model architecture for fast and simultaneous agent future heatmap estimation.
generating scene-consistent predictions goes beyond the mere generation of collision-free trajectories.
We report our results on the Interaction multi-agent prediction challenge and rank $1st$ on the online test leaderboard.
arXiv Detail & Related papers (2021-10-13T10:05:47Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Generalized Adversarially Learned Inference [42.40405470084505]
We develop methods of inference of latent variables in GANs by adversarially training an image generator along with an encoder to match two joint distributions of image and latent vector pairs.
We incorporate multiple layers of feedback on reconstructions, self-supervision, and other forms of supervision based on prior or learned knowledge about the desired solutions.
arXiv Detail & Related papers (2020-06-15T02:18:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.