Defending Against Unforeseen Failure Modes with Latent Adversarial Training
- URL: http://arxiv.org/abs/2403.05030v4
- Date: Thu, 22 Aug 2024 00:24:50 GMT
- Title: Defending Against Unforeseen Failure Modes with Latent Adversarial Training
- Authors: Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell,
- Abstract summary: Red-teaming and adversarial training (AT) are commonly used to improve robustness.
In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without leveraging knowledge of what they are.
We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness to novel attacks and performance on clean data relative to AT.
- Score: 7.141982906162117
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically struggle to fix failure modes that differ from the attacks used during training. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without leveraging knowledge of what they are or using inputs that elicit them. LAT makes use of the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. Here, we use it to defend against failure modes without examples that elicit them. Specifically, we use LAT to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness to novel attacks and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.
Related papers
- Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs [13.03032975937872]
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to.
Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities.
arXiv Detail & Related papers (2024-07-22T11:19:14Z) - Improving Alignment and Robustness with Circuit Breakers [40.4558948850276]
We present an approach that interrupts the models as they respond with harmful outputs with "circuit breakers"
As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs.
We extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack.
arXiv Detail & Related papers (2024-06-06T17:57:04Z) - Language Guided Adversarial Purification [3.9931474959554496]
Adversarial purification using generative models demonstrates strong adversarial defense performance.
New framework, Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators.
arXiv Detail & Related papers (2023-09-19T06:17:18Z) - RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target.
RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead.
Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z) - Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples.
Yet, its passive nature inevitably prevents it from being immune to unknown attackers.
We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z) - Universal Adversarial Training with Class-Wise Perturbations [78.05383266222285]
adversarial training is the most widely used method for defending against adversarial attacks.
In this work, we find that a UAP does not attack all classes equally.
We improve the SOTA UAT by proposing to utilize class-wise UAPs during adversarial training.
arXiv Detail & Related papers (2021-04-07T09:05:49Z) - Proper Network Interpretability Helps Adversarial Robustness in
Classification [91.39031895064223]
We show that with a proper measurement of interpretation, it is difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy.
We develop an interpretability-aware defensive scheme built only on promoting robust interpretation.
We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation.
arXiv Detail & Related papers (2020-06-26T01:31:31Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z) - Adversarial Feature Desensitization [12.401175943131268]
We propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field.
Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs.
arXiv Detail & Related papers (2020-06-08T14:20:02Z) - Testing Robustness Against Unforeseen Adversaries [54.75108356391557]
Adversarial robustness research primarily focuses on L_p perturbations.
In real-world applications developers are unlikely to have access to the full range of attacks or corruptions their system will face.
We introduce ImageNet-UA, a framework for evaluating model robustness against a range of unforeseen adversaries.
arXiv Detail & Related papers (2019-08-21T17:36:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.