Related papers: Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

URL: http://arxiv.org/abs/2403.05030v3
Date: Mon, 1 Apr 2024 21:32:18 GMT
Title: Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Authors: Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell,
Abstract summary: Red-teaming and adversarial training (AT) are commonly used to improve robustness. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without generating inputs that elicit them.
Score: 7.141982906162117
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically struggle to fix failure modes that differ from the attacks used during training. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without generating inputs that elicit them. LAT leverages the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. We use it to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness to novel attacks and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.

Related papers

Improving Large Language Model Safety with Contrastive Representation Learning [92.79965952162298]
Large Language Models (LLMs) are powerful tools with profound societal impacts.<n>Their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks.<n>We propose a defense framework that formulates model defense as a contrastive representation learning problem.
arXiv Detail & Related papers (2025-06-13T16:42:09Z)
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs [13.03032975937872]
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities.
arXiv Detail & Related papers (2024-07-22T11:19:14Z)
Improving Alignment and Robustness with Circuit Breakers [40.4558948850276]
We present an approach that interrupts the models as they respond with harmful outputs with "circuit breakers" As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs. We extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack.
arXiv Detail & Related papers (2024-06-06T17:57:04Z)
Language Guided Adversarial Purification [3.9931474959554496]
Adversarial purification using generative models demonstrates strong adversarial defense performance. New framework, Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators.
arXiv Detail & Related papers (2023-09-19T06:17:18Z)
Adversary Aware Continual Learning [3.3439097577935213]
Adversary can introduce small amount of misinformation to the model to cause deliberate forgetting of a specific task or class at test time. We use the attacker's primary strength-hiding the backdoor pattern by making it imperceptible to humans-against it, and propose to learn a perceptible (stronger) pattern that can overpower the attacker's imperceptible pattern. We show that our proposed defensive framework considerably improves the performance of class incremental learning algorithms with no knowledge of the attacker's target task, attacker's target class, and attacker's imperceptible pattern.
arXiv Detail & Related papers (2023-04-27T19:49:50Z)
RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target. RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead. Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z)
Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)
Universal Adversarial Training with Class-Wise Perturbations [78.05383266222285]
adversarial training is the most widely used method for defending against adversarial attacks. In this work, we find that a UAP does not attack all classes equally. We improve the SOTA UAT by proposing to utilize class-wise UAPs during adversarial training.
arXiv Detail & Related papers (2021-04-07T09:05:49Z)
Proper Network Interpretability Helps Adversarial Robustness in Classification [91.39031895064223]
We show that with a proper measurement of interpretation, it is difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy. We develop an interpretability-aware defensive scheme built only on promoting robust interpretation. We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation.
arXiv Detail & Related papers (2020-06-26T01:31:31Z)
A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems. This paper proposes a self-supervised adversarial training mechanism in the input space. It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
Adversarial Feature Desensitization [12.401175943131268]
We propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs.
arXiv Detail & Related papers (2020-06-08T14:20:02Z)
Testing Robustness Against Unforeseen Adversaries [54.75108356391557]
Adversarial robustness research primarily focuses on L_p perturbations. In real-world applications developers are unlikely to have access to the full range of attacks or corruptions their system will face. We introduce ImageNet-UA, a framework for evaluating model robustness against a range of unforeseen adversaries.
arXiv Detail & Related papers (2019-08-21T17:36:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.