Latent Adversarial Training Improves the Representation of Refusal
- URL: http://arxiv.org/abs/2504.18872v1
- Date: Sat, 26 Apr 2025 09:40:31 GMT
- Title: Latent Adversarial Training Improves the Representation of Refusal
- Authors: Alexandra Abbas, Nora Petrova, Helios Ael Lyons, Natalia Perez-Campanero,
- Abstract summary: We show how Latent Adversarial Training (LAT) reorganizes the refusal behavior in the model's latent space.<n>LAT significantly alters the refusal representation, concentrating it in the first two SVD components.<n>Our findings suggest that LAT's training perturbations enable a more comprehensive representation of refusal behavior.
- Score: 41.94295877935867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that language models' refusal behavior is primarily encoded in a single direction in their latent space, making it vulnerable to targeted attacks. Although Latent Adversarial Training (LAT) attempts to improve robustness by introducing noise during training, a key question remains: How does this noise-based training affect the underlying representation of refusal behavior? Understanding this encoding is crucial for evaluating LAT's effectiveness and limitations, just as the discovery of linear refusal directions revealed vulnerabilities in traditional supervised safety fine-tuning (SSFT). Through the analysis of Llama 2 7B, we examine how LAT reorganizes the refusal behavior in the model's latent space compared to SSFT and embedding space adversarial training (AT). By computing activation differences between harmful and harmless instruction pairs and applying Singular Value Decomposition (SVD), we find that LAT significantly alters the refusal representation, concentrating it in the first two SVD components which explain approximately 75 percent of the activation differences variance - significantly higher than in reference models. This concentrated representation leads to more effective and transferable refusal vectors for ablation attacks: LAT models show improved robustness when attacked with vectors from reference models but become more vulnerable to self-generated vectors compared to SSFT and AT. Our findings suggest that LAT's training perturbations enable a more comprehensive representation of refusal behavior, highlighting both its potential strengths and vulnerabilities for improving model safety.
Related papers
- Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - Mitigating Adversarial Perturbations for Deep Reinforcement Learning via Vector Quantization [18.56608399174564]
Well-performing reinforcement learning (RL) agents often lack resilience against adversarial perturbations during deployment.
This highlights the importance of building a robust agent before deploying it in the real world.
In this work, we study an input transformation-based defense for RL.
arXiv Detail & Related papers (2024-10-04T12:41:54Z) - Perturbation-Invariant Adversarial Training for Neural Ranking Models:
Improving the Effectiveness-Robustness Trade-Off [107.35833747750446]
adversarial examples can be crafted by adding imperceptible perturbations to legitimate documents.
This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs.
In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs.
arXiv Detail & Related papers (2023-12-16T05:38:39Z) - Learn from the Past: A Proxy Guided Adversarial Defense Framework with
Self Distillation Regularization [53.04697800214848]
Adversarial Training (AT) is pivotal in fortifying the robustness of deep learning models.
AT methods, relying on direct iterative updates for target model's defense, frequently encounter obstacles such as unstable training and catastrophic overfitting.
We present a general proxy guided defense framework, LAST' (bf Learn from the Pbf ast)
arXiv Detail & Related papers (2023-10-19T13:13:41Z) - PEARL: Preprocessing Enhanced Adversarial Robust Learning of Image
Deraining for Semantic Segmentation [42.911517493220664]
We present the first attempt to improve the robustness of semantic segmentation tasks by simultaneously handling different types of degradation factors.
Our approach effectively handles both rain streaks and adversarial perturbation by transferring the robustness of the segmentation model to the image derain model.
As opposed to the commonly used Negative Adversarial Attack (NAA), we design the Auxiliary Mirror Attack (AMA) to introduce positive information prior to the training of the PEARL framework.
arXiv Detail & Related papers (2023-05-25T04:44:17Z) - Consistent Valid Physically-Realizable Adversarial Attack against
Crowd-flow Prediction Models [4.286570387250455]
deep learning (DL) models can effectively learn city-wide crowd-flow patterns.
DL models have been known to perform poorly on inconspicuous adversarial perturbations.
arXiv Detail & Related papers (2023-03-05T13:30:25Z) - Improving Adversarial Robustness to Sensitivity and Invariance Attacks
with Deep Metric Learning [80.21709045433096]
A standard method in adversarial robustness assumes a framework to defend against samples crafted by minimally perturbing a sample.
We use metric learning to frame adversarial regularization as an optimal transport problem.
Our preliminary results indicate that regularizing over invariant perturbations in our framework improves both invariant and sensitivity defense.
arXiv Detail & Related papers (2022-11-04T13:54:02Z) - Enhancing Adversarial Training with Feature Separability [52.39305978984573]
We introduce a new concept of adversarial training graph (ATG) with which the proposed adversarial training with feature separability (ATFS) enables to boost the intra-class feature similarity and increase inter-class feature variance.
Through comprehensive experiments, we demonstrate that the proposed ATFS framework significantly improves both clean and robust performance.
arXiv Detail & Related papers (2022-05-02T04:04:23Z) - Adversarial Momentum-Contrastive Pre-Training [20.336258934272102]
Adversarial self-supervised pre-training is helpful to extract the invariant representations under both data augmentations and adversarial perturbations.
This paper proposes a novel adversarial momentum-contrastive (AMOC) pre-training approach.
Compared with the existing self-supervised pre-training approaches, AMOC can use a smaller batch size and fewer training epochs but learn more robust features.
arXiv Detail & Related papers (2020-12-24T07:49:10Z) - Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness.
We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.