Efficient Robustness Assessment via Adversarial Spatial-Temporal Focus
on Videos
- URL: http://arxiv.org/abs/2301.00896v2
- Date: Mon, 27 Mar 2023 01:57:56 GMT
- Title: Efficient Robustness Assessment via Adversarial Spatial-Temporal Focus
on Videos
- Authors: Wei Xingxing and Wang Songping and Yan Huanqian
- Abstract summary: We design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions.
By continuously querying, the reduced searching space composed of key frames and key regions is becoming precise.
Experiments on four mainstream video recognition models and three widely used action recognition datasets demonstrate that the proposed AstFocus attack outperforms the SOTA methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adversarial robustness assessment for video recognition models has raised
concerns owing to their wide applications on safety-critical tasks. Compared
with images, videos have much high dimension, which brings huge computational
costs when generating adversarial videos. This is especially serious for the
query-based black-box attacks where gradient estimation for the threat models
is usually utilized, and high dimensions will lead to a large number of
queries. To mitigate this issue, we propose to simultaneously eliminate the
temporal and spatial redundancy within the video to achieve an effective and
efficient gradient estimation on the reduced searching space, and thus query
number could decrease. To implement this idea, we design the novel Adversarial
spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on
the simultaneously focused key frames and key regions from the inter-frames and
intra-frames in the video. AstFocus attack is based on the cooperative
Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible
for selecting key frames, and another agent is responsible for selecting key
regions. These two agents are jointly trained by the common rewards received
from the black-box threat models to perform a cooperative prediction. By
continuously querying, the reduced searching space composed of key frames and
key regions is becoming precise, and the whole query number becomes less than
that on the original video. Extensive experiments on four mainstream video
recognition models and three widely used action recognition datasets
demonstrate that the proposed AstFocus attack outperforms the SOTA methods,
which is prevenient in fooling rate, query number, time, and perturbation
magnitude at the same.
Related papers
- Adversarial Attacks on Video Object Segmentation with Hard Region
Discovery [31.882369005280793]
Video object segmentation has been applied to various computer vision tasks, such as video editing, autonomous driving, and human-robot interaction.
Deep neural networks are vulnerable to adversarial examples, which are the inputs attacked by almost human-imperceptible perturbations.
This will rise the security issues in highly-demanding tasks because small perturbations to the input video will result in potential attack risks.
arXiv Detail & Related papers (2023-09-25T03:52:15Z) - Inter-frame Accelerate Attack against Video Interpolation Models [73.28751441626754]
We apply adversarial attacks to VIF models and find that the VIF models are very vulnerable to adversarial examples.
We propose a novel attack method named Inter-frame Accelerate Attack (IAA) thats the iterations as the perturbation for the previous adjacent frame.
It is shown that our method can improve attack efficiency greatly while achieving comparable attack performance with traditional methods.
arXiv Detail & Related papers (2023-05-11T03:08:48Z) - Efficient Decision-based Black-box Patch Attacks on Video Recognition [33.5640770588839]
This work first explores decision-based patch attacks on video models.
To achieve a query-efficient attack, we propose a spatial-temporal differential evolution framework.
STDE has demonstrated state-of-the-art performance in terms of threat, efficiency and imperceptibility.
arXiv Detail & Related papers (2023-03-21T15:08:35Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - E^2TAD: An Energy-Efficient Tracking-based Action Detector [78.90585878925545]
This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions.
It won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC)
arXiv Detail & Related papers (2022-04-09T07:52:11Z) - Fast Online Video Super-Resolution with Deformable Attention Pyramid [172.16491820970646]
Video super-resolution (VSR) has many applications that pose strict causal, real-time, and latency constraints, including video streaming and TV.
We propose a recurrent VSR architecture based on a deformable attention pyramid (DAP)
arXiv Detail & Related papers (2022-02-03T17:49:04Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Event and Activity Recognition in Video Surveillance for Cyber-Physical
Systems [0.0]
Long-term motion patterns alone play a pivotal role in the task of recognizing an event.
We show that the long-term motion patterns alone play a pivotal role in the task of recognizing an event.
Only the temporal features are exploited using a hybrid Convolutional Neural Network (CNN) + Recurrent Neural Network (RNN) architecture.
arXiv Detail & Related papers (2021-11-03T08:30:38Z) - Attacking Video Recognition Models with Bullet-Screen Comments [79.53159486470858]
We introduce a novel adversarial attack, which attacks video recognition models with bullet-screen comment (BSC) attacks.
BSCs can be regarded as a kind of meaningful patch, adding it to a clean video will not affect people' s understanding of the video content, nor will arouse people' s suspicion.
arXiv Detail & Related papers (2021-10-29T08:55:50Z) - Reinforcement Learning Based Sparse Black-box Adversarial Attack on
Video Recognition Models [3.029434408969759]
Black-box adversarial attacks are only performed on selected key regions and key frames.
We propose a reinforcement learning based frame selection strategy to speed up the attack process.
A range of empirical results on real datasets demonstrate the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2021-08-29T12:22:40Z) - Sparse Black-box Video Attack with Reinforcement Learning [14.624074868199287]
We formulate the black-box video attacks into a Reinforcement Learning framework.
The environment in RL is set as the recognition model, and the agent in RL plays the role of frame selecting.
We conduct a series of experiments with two mainstream video recognition models.
arXiv Detail & Related papers (2020-01-11T14:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.