Revisiting the Robust Alignment of Circuit Breakers
- URL: http://arxiv.org/abs/2407.15902v2
- Date: Fri, 2 Aug 2024 12:02:47 GMT
- Title: Revisiting the Robust Alignment of Circuit Breakers
- Authors: Leo Schwinn, Simon Geisler,
- Abstract summary: We show that the robustness claims of "Improving Alignment and Robustness with Circuit Breakers" may be overestimated.
Specifically, we demonstrate that by implementing a few simple changes to embedding space attacks, we achieve 100% attack success rate (ASR) against circuit breaker models.
- Score: 10.852294343899487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the past decade, adversarial training has emerged as one of the few reliable methods for enhancing model robustness against adversarial attacks [Szegedy et al., 2014, Madry et al., 2018, Xhonneux et al., 2024], while many alternative approaches have failed to withstand rigorous subsequent evaluations. Recently, an alternative defense mechanism, namely "circuit breakers" [Zou et al., 2024], has shown promising results for aligning LLMs. In this report, we show that the robustness claims of "Improving Alignment and Robustness with Circuit Breakers" against unconstraint continuous attacks in the embedding space of the input tokens may be overestimated [Zou et al., 2024]. Specifically, we demonstrate that by implementing a few simple changes to embedding space attacks [Schwinn et al., 2024a,b], we achieve 100% attack success rate (ASR) against circuit breaker models. Without conducting any further hyperparameter tuning, these adjustments increase the ASR by more than 80% compared to the original evaluation. Code is accessible at: https://github.com/SchwinnL/circuit-breakers-eval
Related papers
- Regularized Robustly Reliable Learners and Instance Targeted Attacks [11.435833538081557]
Balcan et al (2022) proposed an approach to addressing this challenge by defining a notion of robustly-reliable learners.
We show that at least in certain interesting cases we can design algorithms that can produce their outputs in time sublinear in training time.
arXiv Detail & Related papers (2024-10-14T14:49:32Z) - MOAT: Securely Mitigating Rowhammer with Per-Row Activation Counters [0.3580891736370874]
DDR5 specifications have been extended to support Per-Row Activation Counting (PRAC), with counters inlined with each row, and ALERT-Back-Off (ABO) to stop the memory controller if the DRAM needs more time to mitigate.
Although PRAC+ABO represents a strong advance in Rowhammer protection, they are just a framework, and the actual security is dependent on the implementation.
We propose MOAT, a provably secure design, which uses two internal thresholds: ETH, an "Eligibility Threshold" for mitigating a row, and ATH, an "ALERT Threshold" for initiating
arXiv Detail & Related papers (2024-07-13T20:28:02Z) - Towards Robust Domain Generation Algorithm Classification [1.4542411354617986]
We implement 32 white-box attacks, 19 of which are very effective and induce a false-negative rate (FNR) of $approx$ 100% on unhardened classifiers.
We propose a novel training scheme that leverages adversarial latent space vectors and discretized adversarial domains to significantly improve robustness.
arXiv Detail & Related papers (2024-04-09T11:56:29Z) - FLIP: A Provable Defense Framework for Backdoor Mitigation in Federated
Learning [66.56240101249803]
We study how hardening benign clients can affect the global model (and the malicious clients)
We propose a trigger reverse engineering based defense and show that our method can achieve improvement with guarantee robustness.
Our results on eight competing SOTA defense methods show the empirical superiority of our method on both single-shot and continuous FL backdoor attacks.
arXiv Detail & Related papers (2022-10-23T22:24:03Z) - Sparse and Imperceptible Adversarial Attack via a Homotopy Algorithm [93.80082636284922]
Sparse adversarial attacks can fool deep networks (DNNs) by only perturbing a few pixels.
Recent efforts combine it with another l_infty perturbation on magnitudes.
We propose a homotopy algorithm to tackle the sparsity and neural perturbation framework.
arXiv Detail & Related papers (2021-06-10T20:11:36Z) - Adaptive Feature Alignment for Adversarial Training [56.17654691470554]
CNNs are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications.
We propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths.
Our method is trained to automatically align features of arbitrary attacking strength.
arXiv Detail & Related papers (2021-05-31T17:01:05Z) - Adversarial Robustness by Design through Analog Computing and Synthetic
Gradients [80.60080084042666]
We propose a new defense mechanism against adversarial attacks inspired by an optical co-processor.
In the white-box setting, our defense works by obfuscating the parameters of the random projection.
We find the combination of a random projection and binarization in the optical system also improves robustness against various types of black-box attacks.
arXiv Detail & Related papers (2021-01-06T16:15:29Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z) - Reliable evaluation of adversarial robustness with an ensemble of
diverse parameter-free attacks [65.20660287833537]
In this paper we propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function.
We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness.
arXiv Detail & Related papers (2020-03-03T18:15:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.