Related papers: The Geometry of Harmfulness in LLMs through Subconcept Probing

The Geometry of Harmfulness in LLMs through Subconcept Probing

URL: http://arxiv.org/abs/2507.21141v1
Date: Wed, 23 Jul 2025 07:56:05 GMT
Title: The Geometry of Harmfulness in LLMs through Subconcept Probing
Authors: McNair Shah, Saleena Angeline, Adhitya Rajendra Kumar, Naitik Chheda, Kevin Zhu, Vasu Sharma, Sean O'Brien, Will Cai,
Abstract summary: We introduce a multidimensional framework for probing and steering harmful content in language models.<n>For each of 55 distinct harmfulness subconcepts, we learn a linear probe, yielding 55 interpretable directions in activation space.<n>We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction.
Score: 3.6335172274433414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.

Related papers

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning [6.652200654829215]
We learn non-basis-aligned subspaces in an unsupervised manner.<n>Results show that encoded information in obtained subspaces tends to share the same abstract concept across different inputs.<n>We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing.
arXiv Detail & Related papers (2025-08-03T20:59:29Z)
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z)
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models [41.553639748766784]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.<n>This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces.
arXiv Detail & Related papers (2025-05-22T03:46:57Z)
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study [4.724646466332421]
We study whether safety-relevant behavior is concentrated in specific subspaces.<n>We find no evidence of a subspace that selectively governs safety.<n>This suggests that subspace-based defenses may face fundamental limitations.
arXiv Detail & Related papers (2025-05-20T10:41:49Z)
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions [20.522881564776434]
We find that safety-aligned behavior is jointly controlled by multi-dimensional directions.<n>By studying directions in the space, we first find that a dominant direction governs the model's refusal behavior.<n>We then measure how different directions promote or suppress the dominant direction.
arXiv Detail & Related papers (2025-02-13T06:39:22Z)
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts [68.48103545146127]
This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models.
arXiv Detail & Related papers (2024-10-25T21:44:51Z)
Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals [52.123343364599094]
adversarial attacks place carefully crafted perturbations on normal examples to fool deep neural networks (DNNs) We first empirically show that the features of either clean signals or adversarial perturbations are redundant and span in low-dimensional linear subspaces respectively with minimal overlap. This makes it possible for DNNs to learn a subspace where only features of clean signals exist while those of perturbations are discarded.
arXiv Detail & Related papers (2024-03-24T14:35:44Z)
Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text.<n>One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations.<n>This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z)
METRA: Scalable Unsupervised RL with Metric-Aware Abstraction [69.90741082762646]
Metric-Aware Abstraction (METRA) is a novel unsupervised reinforcement learning objective. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors. We show that METRA can discover a variety of useful behaviors even in complex, pixel-based environments.
arXiv Detail & Related papers (2023-10-13T06:43:11Z)
A Geometric Notion of Causal Probing [85.49839090913515]
The linear subspace hypothesis states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace.<n>We give a set of intrinsic criteria which characterize an ideal linear concept subspace.<n>We find that, for at least one concept across two languages models, the concept subspace can be used to manipulate the concept value of the generated word with precision.
arXiv Detail & Related papers (2023-07-27T17:57:57Z)
Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks. We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
Unsupervised Discriminative Embedding for Sub-Action Learning in Complex Activities [54.615003524001686]
This paper proposes a novel approach for unsupervised sub-action learning in complex activities. The proposed method maps both visual and temporal representations to a latent space where the sub-actions are learnt discriminatively. We show that the proposed combination of visual-temporal embedding and discriminative latent concepts allow to learn robust action representations in an unsupervised setting.
arXiv Detail & Related papers (2021-04-30T20:07:27Z)
Where and What? Examining Interpretable Disentangled Representations [96.32813624341833]
Capturing interpretable variations has long been one of the goals in disentanglement learning. Unlike the independence assumption, interpretability has rarely been exploited to encourage disentanglement in the unsupervised setting. In this paper, we examine the interpretability of disentangled representations by investigating two questions: where to be interpreted and what to be interpreted.
arXiv Detail & Related papers (2021-04-07T11:22:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.