Memory-augmented Online Video Anomaly Detection
- URL: http://arxiv.org/abs/2302.10719v2
- Date: Wed, 27 Sep 2023 13:14:41 GMT
- Title: Memory-augmented Online Video Anomaly Detection
- Authors: Leonardo Rossi, Vittorio Bernuzzi, Tomaso Fontanini, Massimo Bertozzi,
Andrea Prati
- Abstract summary: This paper presents a system capable to work in an online fashion, exploiting only the videos captured by a dash-mounted camera.
Movad is able to reach an AUC score of 82.17%, surpassing the current state-of-the-art by +2.87 AUC.
- Score: 2.269915940890348
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The ability to understand the surrounding scene is of paramount importance
for Autonomous Vehicles (AVs). This paper presents a system capable to work in
an online fashion, giving an immediate response to the arise of anomalies
surrounding the AV, exploiting only the videos captured by a dash-mounted
camera. Our architecture, called MOVAD, relies on two main modules: a
Short-Term Memory Module to extract information related to the ongoing action,
implemented by a Video Swin Transformer (VST), and a Long-Term Memory Module
injected inside the classifier that considers also remote past information and
action context thanks to the use of a Long-Short Term Memory (LSTM) network.
The strengths of MOVAD are not only linked to its excellent performance, but
also to its straightforward and modular architecture, trained in a end-to-end
fashion with only RGB frames with as less assumptions as possible, which makes
it easy to implement and play with. We evaluated the performance of our method
on Detection of Traffic Anomaly (DoTA) dataset, a challenging collection of
dash-mounted camera videos of accidents. After an extensive ablation study,
MOVAD is able to reach an AUC score of 82.17\%, surpassing the current
state-of-the-art by +2.87 AUC. Our code will be available on
https://github.com/IMPLabUniPr/movad/tree/movad_vad
Related papers
- MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection [21.22536962888316]
We present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory.<n>MoSAM achieves state-of-the-art results compared to other competitors.
arXiv Detail & Related papers (2025-04-30T02:19:31Z) - Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos [24.03405963900272]
Video Camouflaged Object Detection aims to segment objects whose appearances closely resemble their surroundings.
Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects.
We propose an end-to-end framework inspired by human memory-recognition.
arXiv Detail & Related papers (2025-03-21T11:08:14Z) - Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory [17.956367558818076]
Episodic memory retrieval aims to enable wearable devices with the ability to recollect from past video observations objects or events that have been observed.
Current task formulations are based on the "offline" assumption that the full video history can be accessed when the user makes a query.
We introduce the novel task of Online Episodic Memory Visual Ego Queries (OEM-VQL), in which models are required to work in an online fashion, observing video frames only once and relying on past computations to answer user queries.
arXiv Detail & Related papers (2024-11-25T21:07:25Z) - ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.
We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.
We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z) - Adversarial Memory Networks for Action Prediction [95.09968654228372]
Action prediction aims to infer the forthcoming human action with partially-observed videos.
We propose adversarial memory networks (AMemNet) to generate the "full video" feature conditioning on a partial video query.
arXiv Detail & Related papers (2021-12-18T08:16:21Z) - MUNet: Motion Uncertainty-aware Semi-supervised Video Object
Segmentation [31.100954335785026]
We advocate the return of the emphmotion information and propose a motion uncertainty-aware framework (MUNet) for semi-supervised video object segmentation.
We introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature.
We achieve $76.5%$ $mathcalJ & mathcalF$ only using DAVIS17 for training, which significantly outperforms the textitSOTA methods under the low-data protocol.
arXiv Detail & Related papers (2021-11-29T16:01:28Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Local Memory Attention for Fast Video Semantic Segmentation [157.7618884769969]
We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline.
Our approach aggregates a rich representation of the semantic information in past frames into a memory module.
We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms.
arXiv Detail & Related papers (2021-01-05T18:57:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.