Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
- URL: http://arxiv.org/abs/2508.01916v1
- Date: Sun, 03 Aug 2025 20:59:29 GMT
- Title: Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
- Authors: Xinting Huang, Michael Hahn,
- Abstract summary: We learn non-basis-aligned subspaces in an unsupervised manner.<n>Results show that encoded information in obtained subspaces tends to share the same abstract concept across different inputs.<n>We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing.
- Score: 6.652200654829215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.
Related papers
- Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces [31.401762286885656]
Understanding the space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment.<n>baturayWe investigate what extent LLMs internally organize related to semantic understanding.
arXiv Detail & Related papers (2025-07-13T17:03:25Z) - SliderSpace: Decomposing the Visual Capabilities of Diffusion Models [50.82362500995365]
SliderSpace is a framework for automatically decomposing the visual capabilities of diffusion models.<n>It discovers multiple interpretable and diverse directions simultaneously from a single text prompt.<n>Our method produces more diverse and useful variations compared to baselines.
arXiv Detail & Related papers (2025-02-03T18:59:55Z) - Unsupervised Panoptic Interpretation of Latent Spaces in GANs Using Space-Filling Vector Quantization [9.181917968017258]
Generative adversarial networks (GANs) learn a latent space whose samples can be mapped to real-world images.<n>Some earlier supervised methods aim to create an interpretable latent space or discover interpretable directions.<n>We propose using a modification of vector quantization called space-filling vector quantization (SFVQ), which quantizes the data on a piece-wise linear curve.
arXiv Detail & Related papers (2024-10-27T19:56:02Z) - Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts [68.48103545146127]
This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces.
We directly leverage natural language prompts and image captions to map latent directions.
Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models.
arXiv Detail & Related papers (2024-10-25T21:44:51Z) - Input Space Mode Connectivity in Deep Neural Networks [5.8470747480006695]
We extend the concept of loss landscape mode connectivity to the input space of deep neural networks.
We present theoretical and empirical evidence of its presence in the input space of deep networks.
We exploit mode connectivity to obtain new insights about adversarial examples and demonstrate its potential for adversarial detection.
arXiv Detail & Related papers (2024-09-09T17:03:43Z) - More than Correlation: Do Large Language Models Learn Causal
Representations of Space? [6.293100288400849]
This study focused on uncovering the causality of the spatial representations in large language models.
Experiments showed that the spatial representations influenced the model's performance on next word prediction and a downstream task that relies on geospatial information.
arXiv Detail & Related papers (2023-12-26T01:27:29Z) - Occlusion Sensitivity Analysis with Augmentation Subspace Perturbation
in Deep Feature Space [7.021872917042116]
We introduce the Occlusion Sensitivity Analysis with Deep Feature Augmentation Subspace (OSA-DAS), a novel perturbation-based interpretability approach for computer vision.
Our method utilizes the output vector of a DNN to build low-dimensional subspaces within the deep feature vector space, offering a more precise explanation of the model prediction.
We test extensively on the ImageNet-1k, and our class- and model-agnostic approach outperforms commonly used interpreters.
arXiv Detail & Related papers (2023-11-25T13:26:40Z) - A Geometric Notion of Causal Probing [85.49839090913515]
The linear subspace hypothesis states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace.<n>We give a set of intrinsic criteria which characterize an ideal linear concept subspace.<n>We find that, for at least one concept across two languages models, the concept subspace can be used to manipulate the concept value of the generated word with precision.
arXiv Detail & Related papers (2023-07-27T17:57:57Z) - Exploring the Common Principal Subspace of Deep Features in Neural
Networks [50.37178960258464]
We find that different Deep Neural Networks (DNNs) trained with the same dataset share a common principal subspace in latent spaces.
Specifically, we design a new metric $mathcalP$-vector to represent the principal subspace of deep features learned in a DNN.
Small angles (with cosine close to $1.0$) have been found in the comparisons between any two DNNs trained with different algorithms/architectures.
arXiv Detail & Related papers (2021-10-06T15:48:32Z) - EigenGAN: Layer-Wise Eigen-Learning for GANs [84.33920839885619]
EigenGAN is able to unsupervisedly mine interpretable and controllable dimensions from different generator layers.
By traversing the coefficient of a specific eigen-dimension, the generator can produce samples with continuous changes corresponding to a specific semantic attribute.
arXiv Detail & Related papers (2021-04-26T11:14:37Z) - Joint and Progressive Subspace Analysis (JPSA) with Spatial-Spectral
Manifold Alignment for Semi-Supervised Hyperspectral Dimensionality Reduction [48.73525876467408]
We propose a novel technique for hyperspectral subspace analysis.
The technique is called joint and progressive subspace analysis (JPSA)
Experiments are conducted to demonstrate the superiority and effectiveness of the proposed JPSA on two widely-used hyperspectral datasets.
arXiv Detail & Related papers (2020-09-21T16:29:59Z) - Deep Metric Structured Learning For Facial Expression Recognition [58.7528672474537]
We propose a deep metric learning model to create embedded sub-spaces with a well defined structure.
A new loss function that imposes Gaussian structures on the output space is introduced to create these sub-spaces.
We experimentally demonstrate that the learned embedding can be successfully used for various applications including expression retrieval and emotion recognition.
arXiv Detail & Related papers (2020-01-18T06:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.