Related papers: Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment

Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment

URL: http://arxiv.org/abs/2506.01511v1
Date: Mon, 02 Jun 2025 10:18:09 GMT
Title: Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment
Authors: Kaixun Jiang, Zhaoyu Chen, Haijing Guo, Jinglun Li, Jiyuan Fu, Pinxue Guo, Hao Tang, Bo Li, Wenqiang Zhang,
Abstract summary: APA (Adversary Preferences Alignment) is a two-stage framework that decouples conflicting preferences and optimize each with differentiable rewards.<n> APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective.
Score: 26.95607772298534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference alignment in diffusion models has primarily focused on benign human preferences (e.g., aesthetic). In this paper, we propose a novel perspective: framing unrestricted adversarial example generation as a problem of aligning with adversary preferences. Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectiveness, which often lead to unstable optimization and reward hacking (e.g., reducing visual quality to improve attack success). To address this, we propose APA (Adversary Preferences Alignment), a two-stage framework that decouples conflicting preferences and optimizes each with differentiable rewards. In the first stage, APA fine-tunes LoRA to improve visual consistency using rule-based similarity reward. In the second stage, APA updates either the image latent or prompt embedding based on feedback from a substitute classifier, guided by trajectory-level and step-wise rewards. To enhance black-box transferability, we further incorporate a diffusion augmentation strategy. Experiments demonstrate that APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective. Code will be available at https://github.com/deep-kaixun/APA.

Related papers

Make the Most of Everything: Further Considerations on Disrupting Diffusion-based Customization [11.704329867109237]
We propose Dual Anti-Diffusion (DADiff), a two-stage adversarial attack targeting diffusion customization.<n> Experimental results on various mainstream facial datasets demonstrate 10%-30% improvements in cross-prompt, keyword mismatch, cross-model, and cross-mechanism anti-customization.
arXiv Detail & Related papers (2025-03-18T06:22:03Z)
Two Heads Are Better Than One: Averaging along Fine-Tuning to Improve Targeted Transferability [20.46894437876869]
An adversarial example (AE) in feature space can efficiently boost its targeted transferability.<n>Existing fine-tuning schemes only utilize the endpoint and ignore the valuable information in the fine-tuning trajectory.<n>We propose averaging over the fine-tuning trajectory to pull the crafted AE towards a more centered region.
arXiv Detail & Related papers (2024-12-30T09:01:27Z)
Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation [29.667702981248205]
We propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO)<n>Specifically, we introduce a token-level emphvisual-anchored emphreward as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one.<n>To highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization.
arXiv Detail & Related papers (2024-12-19T03:21:01Z)
Query-Efficient Video Adversarial Attack with Stylized Logo [17.268709979991996]
Video classification systems based on Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples. We propose a novel black-box video attack framework, called Stylized Logo Attack (SLA) SLA is conducted through three steps. The first step involves building a style references set for logos, which can not only make the generated examples more natural, but also carry more target class features in the targeted attacks.
arXiv Detail & Related papers (2024-08-22T03:19:09Z)
Improving Adversarial Robustness via Decoupled Visual Representation Masking [65.73203518658224]
In this paper, we highlight two novel properties of robust features from the feature distribution perspective. We find that state-of-the-art defense methods aim to address both of these mentioned issues well. Specifically, we propose a simple but effective defense based on decoupled visual representation masking.
arXiv Detail & Related papers (2024-06-16T13:29:41Z)
A Dense Reward View on Aligning Text-to-Image Diffusion with Preference [54.43177605637759]
We propose a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines.
arXiv Detail & Related papers (2024-02-13T07:37:24Z)
Preference Poisoning Attacks on Reward Model Learning [47.00395978031771]
We investigate the nature and extent of a vulnerability in learning reward models from pairwise comparisons. We propose two classes of algorithmic approaches for these attacks: a gradient-based framework, and several variants of rank-by-distance methods. We find that the best attacks are often highly successful, achieving in the most extreme case 100% success rate with only 0.3% of the data poisoned.
arXiv Detail & Related papers (2024-02-02T21:45:24Z)
Mutual-modality Adversarial Attack with Semantic Perturbation [81.66172089175346]
We propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme. Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
arXiv Detail & Related papers (2023-12-20T05:06:01Z)
Improving Adversarial Transferability via Intermediate-level Perturbation Decay [79.07074710460012]
We develop a novel intermediate-level method that crafts adversarial examples within a single stage of optimization. Experimental results show that it outperforms state-of-the-arts by large margins in attacking various victim models.
arXiv Detail & Related papers (2023-04-26T09:49:55Z)
Logit Margin Matters: Improving Transferable Targeted Adversarial Attack by Logit Calibration [85.71545080119026]
Cross-Entropy (CE) loss function is insufficient to learn transferable targeted adversarial examples. We propose two simple and effective logit calibration methods, which are achieved by downscaling the logits with a temperature factor and an adaptive margin. Experiments conducted on the ImageNet dataset validate the effectiveness of the proposed methods.
arXiv Detail & Related papers (2023-03-07T06:42:52Z)
Transferable Sparse Adversarial Attack [62.134905824604104]
We introduce a generator architecture to alleviate the overfitting issue and thus efficiently craft transferable sparse adversarial examples. Our method achieves superior inference speed, 700$times$ faster than other optimization-based methods.
arXiv Detail & Related papers (2021-05-31T06:44:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.