Related papers: Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

URL: http://arxiv.org/abs/2512.12469v1
Date: Sat, 13 Dec 2025 21:43:17 GMT
Title: Sparse Concept Anchoring for Interpretable and Controllable Neural Representations
Authors: Sandy Fraser, Patryk Wielopolski,
Abstract summary: We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts.<n>The anchored geometry enables two practical interventions: behavioral steering that projects out a concept's latent component at inference, and permanent removal.
Score: 0.9831489366502301
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts while allowing others to self-organize, using only minimal supervision (labels for <0.1% of examples per anchored concept). Training combines activation normalization, a separation regularizer, and anchor or subspace regularizers that attract rare labeled examples to predefined directions or axis-aligned subspaces. The anchored geometry enables two practical interventions: reversible behavioral steering that projects out a concept's latent component at inference, and permanent removal via targeted weight ablation of anchored dimensions. Experiments on structured autoencoders show selective attenuation of targeted concepts with negligible impact on orthogonal features, and complete elimination with reconstruction error approaching theoretical bounds. Sparse Concept Anchoring therefore provides a practical pathway to interpretable, steerable behavior in learned representations.

Related papers

MeGU: Machine-Guided Unlearning with Target Feature Disentanglement [73.49657372882082]
We propose a novel framework that guides unlearning through concept-aware re-alignment.<n>MeGU enables controlled and selective forgetting, effectively mitigating both under-unlearning and over-unlearning.
arXiv Detail & Related papers (2026-02-19T05:20:31Z)
Rethinking Transferable Adversarial Attacks on Point Clouds from a Compact Subspace Perspective [55.919842734983156]
CoSA is a transferable attack framework that operates within a shared low-dimensional semantic space.<n>CoSA consistently outperforms state-of-the-art transferable attacks.
arXiv Detail & Related papers (2026-01-30T15:48:11Z)
Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction [0.0]
We introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models.<n>By aligning layers via concept fingerprints and reconstructing refusal directions using a shared recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space.<n>Our evaluation confirms that these transferred recipes consistently attenuate refusal while maintaining performance.
arXiv Detail & Related papers (2026-01-22T15:08:28Z)
Sparse Attention Post-Training for Mechanistic Interpretability [55.030850996535776]
We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance.<n>Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $approx 0.3 % of its edges.
arXiv Detail & Related papers (2025-12-05T16:40:08Z)
FACE: Faithful Automatic Concept Extraction [4.417419748257645]
FACE (Faithful Automatic Concept Extraction) is a novel framework that augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model's original and concept-based predictions.<n>We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space.
arXiv Detail & Related papers (2025-10-13T17:44:45Z)
Discovering Concept Directions from Diffusion-based Counterfactuals via Latent Clustering [4.891597567642704]
Concept-based explanations have emerged as an effective approach within Explainable Artificial Intelligence.<n>This work introduces Concept Directions via Latent Clustering (CDLC), which extracts global, class-specific concept directions.<n>This approach is validated on a real-world skin lesion dataset.
arXiv Detail & Related papers (2025-05-11T17:53:02Z)
Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations [12.072112471560716]
Concept Activation Vectors (CAVs) are widely used to model human-understandable concepts.<n>They are trained by identifying directions from the activations of concept samples to those of non-concept samples.<n>This method produces similar, non-orthogonal directions for correlated concepts, such as "beard" and "necktie"<n>This entanglement complicates the interpretation of concepts in isolation and can lead to undesired effects in CAV applications.
arXiv Detail & Related papers (2025-03-07T15:45:43Z)
Toward a Flexible Framework for Linear Representation Hypothesis Using Maximum Likelihood Estimation [3.515066520628763]
We introduce a new notion of binary concepts as unit vectors in a canonical representation space.<n>Our method, Sum of Activation-base Normalized Difference (SAND), formalizes the use of activation differences modeled as samples from a von Mises-Fisher distribution.
arXiv Detail & Related papers (2025-02-22T23:56:30Z)
Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformers [10.400355814467401]
Vision transformers (ViTs) can be trained using various learning paradigms, from fully supervised to self-supervised.<n>We propose a concept-based alignment analysis of representations from four different ViTs.<n>The concept-based alignment analysis of representations from four different ViTs reveals that increased supervision correlates with a reduction in the semantic structure of learned representations.
arXiv Detail & Related papers (2024-12-09T16:33:28Z)
Causal Unsupervised Semantic Segmentation [60.178274138753174]
Unsupervised semantic segmentation aims to achieve high-quality semantic grouping without human-labeled annotations. We propose a novel framework, CAusal Unsupervised Semantic sEgmentation (CAUSE), which leverages insights from causal inference.
arXiv Detail & Related papers (2023-10-11T10:54:44Z)
Log-linear Guardedness and its Implications [116.87322784046926]
Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful. This work formally defines the notion of log-linear guardedness as the inability of an adversary to predict the concept directly from the representation. We show that, in the binary case, under certain assumptions, a downstream log-linear model cannot recover the erased concept.
arXiv Detail & Related papers (2022-10-18T17:30:02Z)
Closed-Form Factorization of Latent Semantics in GANs [65.42778970898534]
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner. We propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights.
arXiv Detail & Related papers (2020-07-13T18:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.