The Blessing and Curse of Dimensionality in Safety Alignment
- URL: http://arxiv.org/abs/2507.20333v1
- Date: Sun, 27 Jul 2025 15:51:23 GMT
- Title: The Blessing and Curse of Dimensionality in Safety Alignment
- Authors: Rachel S. Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen,
- Abstract summary: We show that the curse of high-dimensional representations uniquely impacts large language models (LLMs)<n>We show that projecting the representations of the model onto a lower dimensional subspace can preserve sufficient information for alignment while avoiding those linear structures.
- Score: 1.9224072957714322
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The focus on safety alignment in large language models (LLMs) has increased significantly due to their widespread adoption across different domains. The scale of LLMs play a contributing role in their success, and the growth in parameter count follows larger hidden dimensions. In this paper, we hypothesize that while the increase in dimensions has been a key advantage, it may lead to emergent problems as well. These problems emerge as the linear structures in the activation space can be exploited, in the form of activation engineering, to circumvent its safety alignment. Through detailed visualizations of linear subspaces associated with different concepts, such as safety, across various model scales, we show that the curse of high-dimensional representations uniquely impacts LLMs. Further substantiating our claim, we demonstrate that projecting the representations of the model onto a lower dimensional subspace can preserve sufficient information for alignment while avoiding those linear structures. Empirical results confirm that such dimensional reduction significantly reduces susceptibility to jailbreaking through representation engineering. Building on our empirical validations, we provide theoretical insights into these linear jailbreaking methods relative to a model's hidden dimensions. Broadly speaking, our work posits that the high dimensions of a model's internal representations can be both a blessing and a curse in safety alignment.
Related papers
- Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces [31.401762286885656]
Understanding the space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment.<n>baturayWe investigate what extent LLMs internally organize related to semantic understanding.
arXiv Detail & Related papers (2025-07-13T17:03:25Z) - Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z) - Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? [73.80382983108997]
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models.<n>If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution jailbreaks.<n>We propose Concept Concentration (COCA), which simplifies the decision boundary between harmful and benign representations.
arXiv Detail & Related papers (2025-05-24T12:23:52Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models [92.38300626647342]
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs.<n>This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies.
arXiv Detail & Related papers (2025-03-24T20:41:57Z) - The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions [20.522881564776434]
We find that safety-aligned behavior is jointly controlled by multi-dimensional directions.<n>By studying directions in the space, we first find that a dominant direction governs the model's refusal behavior.<n>We then measure how different directions promote or suppress the dominant direction.
arXiv Detail & Related papers (2025-02-13T06:39:22Z) - Preventing Dimensional Collapse in Self-Supervised Learning via Orthogonality Regularization [9.823816643319448]
Self-supervised learning (SSL) has rapidly advanced in recent years, approaching the performance of its supervised counterparts.
dimensional collapse, where a few large eigenvalues dominate the eigenspace, poses a significant obstacle for SSL.
arXiv Detail & Related papers (2024-11-01T06:39:18Z) - Characterizing Truthfulness in Large Language Model Generations with
Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs)
We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z) - On the Embedding Collapse when Scaling up Recommendation Models [53.66285358088788]
We identify the embedding collapse phenomenon as the inhibition of scalability, wherein the embedding matrix tends to occupy a low-dimensional subspace.
We propose a simple yet effective multi-embedding design incorporating embedding-set-specific interaction modules to learn embedding sets with large diversity.
arXiv Detail & Related papers (2023-10-06T17:50:38Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z) - Explicitly Encouraging Low Fractional Dimensional Trajectories Via
Reinforcement Learning [6.548580592686076]
We show that the dimensionality of trajectories induced by model free reinforcement learning agents can be influenced adding a post processing function to the agents reward signal.
We verify that the dimensionality reduction is robust to noise being added to the system and show that that the modified agents are more actually more robust to noise and push disturbances in general for the systems we examined.
arXiv Detail & Related papers (2020-12-21T20:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.