LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model
- URL: http://arxiv.org/abs/2403.11656v2
- Date: Wed, 27 Mar 2024 09:34:44 GMT
- Title: LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model
- Authors: Yuxin Cao, Jinghao Li, Xi Xiao, Derui Wang, Minhui Xue, Hao Ge, Wei Liu, Guangwu Hu,
- Abstract summary: LocalStyleFool is an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos.
We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey.
- Score: 19.37714374680383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.
Related papers
- RegionRoute: Regional Style Transfer with Diffusion Model [31.189878461660115]
We propose an attention-supervised diffusion framework that teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training.<n>A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation.<n> Experiments show that our method achieves mask-free, single-object style transfer at inference.
arXiv Detail & Related papers (2026-02-22T16:11:07Z) - SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains [13.393232074517387]
textbfSAGE improves the generalization of frozen models under privacy constraints.<n>We first utilize style transfer to construct a diverse style representation of the source domain.<n>Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt.
arXiv Detail & Related papers (2025-12-02T03:20:22Z) - Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events [71.2439653098351]
Continuous space-time video super-STVSR has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary temporal scales.<n>We present EvEnhancer, a novel approach that marries unique properties of high temporal and high dynamic range encapsulated in event streams.<n>Our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining generalizability at OOD scales.
arXiv Detail & Related papers (2025-10-04T15:23:07Z) - MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on [16.0505428363005]
We propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer.<n>We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to model the garment consistency of videos.<n>Our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.
arXiv Detail & Related papers (2025-05-27T15:22:02Z) - SVasP: Self-Versatility Adversarial Style Perturbation for Cross-Domain Few-Shot Learning [21.588320570295835]
Cross-Domain Few-Shot Learning aims to transfer knowledge from seen source domains to unseen target domains.
Recent studies focus on utilizing visual styles to bridge the domain gap between different domains.
This paper proposes a novel crop-global style method, called underlinetextbfSelf-underlinetextbfVersatility.
arXiv Detail & Related papers (2024-12-12T08:58:42Z) - UniVST: A Unified Framework for Training-free Localized Video Style Transfer [66.69471376934034]
This paper presents UniVST, a unified framework for localized video style transfer.
It operates without the need for training, offering a distinct advantage over existing methods that transfer style across entire videos.
arXiv Detail & Related papers (2024-10-26T05:28:02Z) - Boosting Adversarial Transferability with Learnable Patch-wise Masks [16.46210182214551]
Adversarial examples have attracted widespread attention in security-critical applications because of their transferability across different models.
In this paper, we argue that the model-specific discriminative regions are a key factor causing overfitting to the source model, and thus reducing the transferability to the target model.
To accurately localize these regions, we present a learnable approach to automatically optimize the mask.
arXiv Detail & Related papers (2023-06-28T05:32:22Z) - A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc.
Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z) - Intra-Source Style Augmentation for Improved Domain Generalization [21.591831983223997]
We propose an intra-source style augmentation (ISSA) method to improve domain generalization in semantic segmentation.
ISSA is model-agnostic and straightforwardly applicable with CNNs and Transformers.
It is also complementary to other domain generalization techniques, e.g., it improves the recent state-of-the-art solution RobustNet by $3%$ mIoU in Cityscapes to Dark Z"urich.
arXiv Detail & Related papers (2022-10-18T21:33:25Z) - Enhancing the Self-Universality for Transferable Targeted Attacks [88.6081640779354]
Our new attack method is proposed based on the observation that highly universal adversarial perturbations tend to be more transferable for targeted attacks.
Instead of optimizing the perturbations on different images, optimizing on different regions to achieve self-universality can get rid of using extra data.
With the feature similarity loss, our method makes the features from adversarial perturbations to be more dominant than that of benign images.
arXiv Detail & Related papers (2022-09-08T11:21:26Z) - Adversarial Style Augmentation for Domain Generalized Urban-Scene
Segmentation [120.96012935286913]
We propose a novel adversarial style augmentation approach, which can generate hard stylized images during training.
Experiments on two synthetic-to-real semantic segmentation benchmarks demonstrate that AdvStyle can significantly improve the model performance on unseen real domains.
arXiv Detail & Related papers (2022-07-11T14:01:25Z) - StyleFool: Fooling Video Classification Systems via Style Transfer [28.19682215735232]
StyleFool is a black-box video adversarial attack via style transfer to fool the video classification system.
StyleFool outperforms the state-of-the-art adversarial attacks in terms of the number of queries and the robustness against existing defenses.
arXiv Detail & Related papers (2022-03-30T02:18:16Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.