Related papers: AlignSAE: Concept-Aligned Sparse Autoencoders

AlignSAE: Concept-Aligned Sparse Autoencoders

URL: http://arxiv.org/abs/2512.02004v1
Date: Mon, 01 Dec 2025 18:58:22 GMT
Title: AlignSAE: Concept-Aligned Sparse Autoencoders
Authors: Minglai Yang, Xinyu Guo, Mihai Surdeanu, Liangming Pan,
Abstract summary: We introduce AlignSAE, a method that aligns SAE features with a defined ontology through a "pre-train, then post-train" curriculum.<n>After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots.<n>This separation creates an interpretable interface where specific relations can be inspected and controlled without interference from unrelated features.
Score: 47.18866175760984
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a defined ontology through a "pre-train, then post-train" curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific relations can be inspected and controlled without interference from unrelated features. Empirical results demonstrate that AlignSAE enables precise causal interventions, such as reliable "concept swaps", by targeting single, semantically aligned slots.

Related papers

LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery [14.222802170483739]
LUCID is a vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations.<n> LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem.<n>Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts.
arXiv Detail & Related papers (2026-02-07T02:01:25Z)
Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z)
Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts [8.768503486874623]
We propose a focused exploration framework that prioritizes curated concepts and their corresponding SAE features over attempts to visualize all available features simultaneously.<n>We present an interactive visualization system that combines topology-based visual encoding with dimensionality reduction to faithfully represent both local and global relationships among selected features.
arXiv Detail & Related papers (2025-11-08T15:36:57Z)
LAVA: Explainability for Unsupervised Latent Embeddings [0.0]
Locality-Aware Variable Associations (LAVA) is designed to explain local embedding organization through its relationship with the input features.<n>Based on UMAP embeddings of MNIST and a single-cell kidney dataset, we show that LAVA captures relevant feature associations.
arXiv Detail & Related papers (2025-09-25T13:38:17Z)
Semantic Concentration for Self-Supervised Dense Representations Learning [103.10708947415092]
Image-level self-supervised learning (SSL) has made significant progress, yet learning dense representations for patches remains challenging.<n>This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration.
arXiv Detail & Related papers (2025-09-11T13:12:10Z)
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [54.683384204063934]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z)
SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning [38.507994878183474]
We introduce Semantically contextualized VIsual Patches (SVIP) for Zero-shot learning (ZSL)<n>We propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space.<n>SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations.
arXiv Detail & Related papers (2025-03-13T10:59:51Z)
Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models. We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z)
Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning [82.29761875805369]
One of the ultimate goals of representation learning is to achieve compactness within a class and well-separability between classes. We propose a novel perspective to use pre-defined class anchors serving as feature centroid to unidirectionally guide feature learning. The proposed Semantic Anchor Regularization (SAR) can be used in a plug-and-play manner in the existing models.
arXiv Detail & Related papers (2023-12-19T05:52:38Z)
Closed-Form Factorization of Latent Semantics in GANs [65.42778970898534]
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner. We propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights.
arXiv Detail & Related papers (2020-07-13T18:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.