Related papers: Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces

Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces

URL: http://arxiv.org/abs/2507.09709v1
Date: Sun, 13 Jul 2025 17:03:25 GMT
Title: Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces
Authors: Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi,
Abstract summary: Understanding the space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment.<n>baturayWe investigate what extent LLMs internally organize related to semantic understanding.
Score: 31.401762286885656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. \baturay{However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors$\unicode{x2013}$even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.

Related papers

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning [6.652200654829215]
We learn non-basis-aligned subspaces in an unsupervised manner.<n>Results show that encoded information in obtained subspaces tends to share the same abstract concept across different inputs.<n>We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing.
arXiv Detail & Related papers (2025-08-03T20:59:29Z)
Vector Ontologies as an LLM world view extraction method [0.0]
Large Language Models (LLMs) possess intricate internal representations of the world, yet these structures are notoriously difficult to interpret or repurpose beyond the original prediction task.<n>A vector ontology defines a domain-specific vector space spanned by ontologically meaningful dimensions, allowing geometric analysis of concepts and relationships within a domain.<n>Using GPT-4o-mini, we extract genre representations through multiple natural language prompts and analyze the consistency of these projections across linguistic variations and their alignment with ground-truth data.
arXiv Detail & Related papers (2025-06-16T08:49:21Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
The Origins of Representation Manifolds in Large Language Models [52.68554895844062]
We show that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths.<n>The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.
arXiv Detail & Related papers (2025-05-23T13:31:22Z)
Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation [2.5976894391099625]
We develop a framework that tracks token dynamics across Transformers layers.<n>This work advances interpretability by reframing Transformers layers as projectors between high-dimensional and low-dimensional semantics.
arXiv Detail & Related papers (2025-03-28T15:47:30Z)
Unraveling the Localized Latents: Learning Stratified Manifold Structures in LLM Embedding Space with Sparse Mixture-of-Experts [3.9426000822656224]
We conjecture that in large language models, the embeddings live in a local manifold structure with different dimensions depending on the perplexities and domains of the input data.<n>By incorporating an attention-based soft-gating network, we verify that our model learns specialized sub-manifolds for an ensemble of input data sources.
arXiv Detail & Related papers (2025-02-19T09:33:16Z)
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts [68.48103545146127]
This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models.
arXiv Detail & Related papers (2024-10-25T21:44:51Z)
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence. We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
The Low-Dimensional Linear Geometry of Contextualized Word Representations [27.50785941238007]
We study the linear geometry of contextualized word representations in ELMO and BERT. We show that a variety of linguistic features are encoded in low-dimensional subspaces.
arXiv Detail & Related papers (2021-05-15T00:58:08Z)
Introducing Orthogonal Constraint in Structural Probes [0.2538209532048867]
We decompose a linear projection of language vector space into isomorphic space rotation and linear scaling directions. We experimentally show that our approach can be performed in a multitask setting.
arXiv Detail & Related papers (2020-12-30T17:14:25Z)
Closed-Form Factorization of Latent Semantics in GANs [65.42778970898534]
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner. We propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights.
arXiv Detail & Related papers (2020-07-13T18:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.