Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment
- URL: http://arxiv.org/abs/2506.01511v1
- Date: Mon, 02 Jun 2025 10:18:09 GMT
- Title: Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment
- Authors: Kaixun Jiang, Zhaoyu Chen, Haijing Guo, Jinglun Li, Jiyuan Fu, Pinxue Guo, Hao Tang, Bo Li, Wenqiang Zhang,
- Abstract summary: APA (Adversary Preferences Alignment) is a two-stage framework that decouples conflicting preferences and optimize each with differentiable rewards.<n> APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective.
- Score: 26.95607772298534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Preference alignment in diffusion models has primarily focused on benign human preferences (e.g., aesthetic). In this paper, we propose a novel perspective: framing unrestricted adversarial example generation as a problem of aligning with adversary preferences. Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectiveness, which often lead to unstable optimization and reward hacking (e.g., reducing visual quality to improve attack success). To address this, we propose APA (Adversary Preferences Alignment), a two-stage framework that decouples conflicting preferences and optimizes each with differentiable rewards. In the first stage, APA fine-tunes LoRA to improve visual consistency using rule-based similarity reward. In the second stage, APA updates either the image latent or prompt embedding based on feedback from a substitute classifier, guided by trajectory-level and step-wise rewards. To enhance black-box transferability, we further incorporate a diffusion augmentation strategy. Experiments demonstrate that APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective. Code will be available at https://github.com/deep-kaixun/APA.
Related papers
- Make the Most of Everything: Further Considerations on Disrupting Diffusion-based Customization [11.704329867109237]
We propose Dual Anti-Diffusion (DADiff), a two-stage adversarial attack targeting diffusion customization.<n> Experimental results on various mainstream facial datasets demonstrate 10%-30% improvements in cross-prompt, keyword mismatch, cross-model, and cross-mechanism anti-customization.
arXiv Detail & Related papers (2025-03-18T06:22:03Z) - Two Heads Are Better Than One: Averaging along Fine-Tuning to Improve Targeted Transferability [20.46894437876869]
An adversarial example (AE) in feature space can efficiently boost its targeted transferability.<n>Existing fine-tuning schemes only utilize the endpoint and ignore the valuable information in the fine-tuning trajectory.<n>We propose averaging over the fine-tuning trajectory to pull the crafted AE towards a more centered region.
arXiv Detail & Related papers (2024-12-30T09:01:27Z) - Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation [29.667702981248205]
We propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO)<n>Specifically, we introduce a token-level emphvisual-anchored emphreward as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one.<n>To highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization.
arXiv Detail & Related papers (2024-12-19T03:21:01Z) - Query-Efficient Video Adversarial Attack with Stylized Logo [17.268709979991996]
Video classification systems based on Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples.
We propose a novel black-box video attack framework, called Stylized Logo Attack (SLA)
SLA is conducted through three steps. The first step involves building a style references set for logos, which can not only make the generated examples more natural, but also carry more target class features in the targeted attacks.
arXiv Detail & Related papers (2024-08-22T03:19:09Z) - Improving Adversarial Robustness via Decoupled Visual Representation Masking [65.73203518658224]
In this paper, we highlight two novel properties of robust features from the feature distribution perspective.
We find that state-of-the-art defense methods aim to address both of these mentioned issues well.
Specifically, we propose a simple but effective defense based on decoupled visual representation masking.
arXiv Detail & Related papers (2024-06-16T13:29:41Z) - A Dense Reward View on Aligning Text-to-Image Diffusion with Preference [54.43177605637759]
We propose a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain.
In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines.
arXiv Detail & Related papers (2024-02-13T07:37:24Z) - Preference Poisoning Attacks on Reward Model Learning [47.00395978031771]
We investigate the nature and extent of a vulnerability in learning reward models from pairwise comparisons.
We propose two classes of algorithmic approaches for these attacks: a gradient-based framework, and several variants of rank-by-distance methods.
We find that the best attacks are often highly successful, achieving in the most extreme case 100% success rate with only 0.3% of the data poisoned.
arXiv Detail & Related papers (2024-02-02T21:45:24Z) - Mutual-modality Adversarial Attack with Semantic Perturbation [81.66172089175346]
We propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme.
Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
arXiv Detail & Related papers (2023-12-20T05:06:01Z) - Improving Adversarial Transferability via Intermediate-level
Perturbation Decay [79.07074710460012]
We develop a novel intermediate-level method that crafts adversarial examples within a single stage of optimization.
Experimental results show that it outperforms state-of-the-arts by large margins in attacking various victim models.
arXiv Detail & Related papers (2023-04-26T09:49:55Z) - Logit Margin Matters: Improving Transferable Targeted Adversarial Attack
by Logit Calibration [85.71545080119026]
Cross-Entropy (CE) loss function is insufficient to learn transferable targeted adversarial examples.
We propose two simple and effective logit calibration methods, which are achieved by downscaling the logits with a temperature factor and an adaptive margin.
Experiments conducted on the ImageNet dataset validate the effectiveness of the proposed methods.
arXiv Detail & Related papers (2023-03-07T06:42:52Z) - Transferable Sparse Adversarial Attack [62.134905824604104]
We introduce a generator architecture to alleviate the overfitting issue and thus efficiently craft transferable sparse adversarial examples.
Our method achieves superior inference speed, 700$times$ faster than other optimization-based methods.
arXiv Detail & Related papers (2021-05-31T06:44:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.